The Palo Alto Networks Add-on for Splunk is the add-on that went missing in our case. This add-on requires that data is ingested via a very specific sourcetype. When it passes through the Indexing tier, it is broken out into different sourcetypes to be analyzed in Splunk. For version 5.x or later of this add-on, the incoming syslog must be configured to use the sourcetype pan:log. The add-on will automatically break this up into different sourcetypes, such as pan:config, pan:traffic, and pan:threat.
Unfortunately, if you don’t follow the add-on installation instructions and pick a different sourcetype, none of this magic will happen. Additionally, if you do pick the right sourcetype, but don’t have the add-on present (which is what happened in this case), you will wind up with a bunch of data that you can’t effectively use in Splunk. Garbage in, garbage out.
There is another issue that customers typically have with their data that made this problem truly difficult to tackle. That is: “How to store backups of syslog data”. Unfortunately for the customer we were working with, the Palo Alto logs were approximately 80-90% of their daily License of 300GB per day. It’s fairly unreasonable to expect a customer to pay for storage to not only Index all of that data to meet their retention policies, but to also store a raw copy of the data separately. One of the reasons that is so ludicrous, is because Splunk technically stores your data in its raw format in a default field called _raw – so there really is no need.
So, what do you do if you wind up in a similarly all-around bad situation? You find a way to export that data, despite the fact that you may have been told “it’s not possible” by some fairly reputable sources.
Fortunately, Splunk has several mechanisms available to return the raw events from a search. For a small dataset, this can be done through SplunkWeb when viewing the search results. For a larger dataset, this often will require the search to be run a second time (even if it was already completed), in order to ensure all the events are returned properly. When dealing with exports containing millions of events or hundreds of gigabytes of data (where a search to export this data could conceivably take days to run), this approach isn’t all that practical.
To address this issue, there are other methods available to export data from Splunk, the full list of which can be referenced here. According to Splunk, “for large exports, the most stable method of search data retrieval is the Command Line Interface (CLI)”. Since we were facing what appeared to be over a terabyte of exported events, this seemed to fit into the “large exports” category. When working with a Splunk Cloud deployment, you don’t technically have CLI access. But this is Splunk – there has to be a way to get this to work.
Utilizing the Splunk CLI
One of the most powerful Splunk features is the the Splunk CLI. While this is not available locally for Splunk Cloud, you can request access to the Splunk Cloud management port, which we do for all of our customers. With that level of access, we have the ability to run search commands on a remote Splunk instance – which means that the CLI export method was now available with Splunk Cloud.
Estimating storage size
We started by spinning up a VM in the lab with a lot of disk space. For this example, we calculated approximately 300gb of raw syslog, over the course of a 4 day window, which approximates to 1.2TB of space. If you need to do a similar calculation, you can use the following search:
Getting Splunk running for the data export
In this lab instance, we installed a copy of Splunk to match the customer’s Splunk Cloud instance, which, at the time of this writing, was 6.4.x. (Note: we initially tried this with a 6.5 instance, but due to some SSL changes in that version, it was unsuccessful. Rather than troubleshooting that issue, matching the version was the simplest solution). From this system, we were able to test running searches against the remote Splunk instance using a sample command such as the following:
You can run this search to verify your basic connectivity and confirm you are getting data returned. However, you probably don’t want to run this as-is because by default it will return everything in that sourcetype/index combination. In the case of a massive amount of data, timeouts and bandwidth will be your enemy.
Fortunately, we have scripting to the rescue!
Once we had our CLI Access and a healthy amount of storage, that was just about all we needed to get off the ground and start exporting our data. All we needed was a little bit of python, which can be found below.
Note that the below script requires the Splunk SDK as well as progressbar.
You will also need to set up the following config.conf file before using the script. Modify the values to fit your specific needs:
This situation was made extra exciting because we ended up running two Splunk instances on the data exfiltration node at the same time, one for outputting data to files and one to re-index data as those files were written. We don’t recommend this scenario long term, but it helped us set this up over a weekend and when we came back in on Monday, everything was solved. Theoretically, if you had everything in place right away, you could do this with one UF, or one HF. We were figuring this out in stages however, so we setup a UF to start downloading the data as we knew it would take a while to export the quantity of data we needed. We utilized a second Splunk instance to start testing ingesting the data to a test index to make sure it was working as expected and to fine-tune some hostname props/transforms settings. This was probably overkill in hindsight, but if you go to implement this, keep in mind you may be able to streamline things even more.
The final process