The Best Guide for Exporting Massive Amounts of Data From Splunk

By Tom Kopchak|Published On: July 11th, 2017|

In the Splunk world, it’s normal to find yourself dealing with massive amounts of data – that’s what Splunk was designed for after all. While Splunk is well equipped for ingesting large quantities of data without issue, it can be significantly more difficult to extract the original raw data from Splunk (if you ever need to).

In many respects, this makes sense. Splunk is primarily designed to be a log archive and analysis platform. The true power of Splunk comes from being able to return the needle in the haystack with some cool visualizations along the way. But, what if you find yourself needing an inordinate amount of hay? Can Splunk be coerced to export a massive amount of data?

You might be asking yourself, “Hey Tom, why would you ever want to do such a thing?” Well, we’re glad you asked. This blog is actually based on a true story where a very important Technology Add-on (TA) went missing from an Index Cluster. That’s not great in general, but especially when you think about the Splunk data pipeline. In our specific case, this missing TA equated to 4 days worth of data not being extracted properly. That resulted in 4 days worth of data that wasn’t CIM compliant, and subsequently 4 days of data that wasn’t populating a customer’s Splunk App for Enterprise Security.

Exporting Massive Amounts of Data From Splunk

The Palo Alto Networks Add-on for Splunk is the add-on that went missing in our case. This add-on requires that data is ingested via a very specific sourcetype. When it passes through the Indexing tier, it is broken out into different sourcetypes to be analyzed in Splunk. For version 5.x or later of this add-on, the incoming syslog must be configured to use the sourcetype pan:log. The add-on will automatically break this up into different sourcetypes, such as pan:config, pan:traffic, and pan:threat.

Unfortunately, if you don’t follow the add-on installation instructions and pick a different sourcetype, none of this magic will happen. Additionally, if you do pick the right sourcetype, but don’t have the add-on present (which is what happened in this case), you will wind up with a bunch of data that you can’t effectively use in Splunk. Garbage in, garbage out.

There is another issue that customers typically have with their data that made this problem truly difficult to tackle. That is: “How to store backups of syslog data”. Unfortunately for the customer we were working with, the Palo Alto logs were approximately 80-90% of their daily License of 300GB per day. It’s fairly unreasonable to expect a customer to pay for storage to not only Index all of that data to meet their retention policies, but to also store a raw copy of the data separately. One of the reasons that is so ludicrous, is because Splunk technically stores your data in its raw format in a default field called _raw – so there really is no need.

So, what do you do if you wind up in a similarly all-around bad situation? You find a way to export that data, despite the fact that you may have been told “it’s not possible” by some fairly reputable sources.

Approach

Fortunately, Splunk has several mechanisms available to return the raw events from a search. For a small dataset, this can be done through SplunkWeb when viewing the search results. For a larger dataset, this often will require the search to be run a second time (even if it was already completed), in order to ensure all the events are returned properly. When dealing with exports containing millions of events or hundreds of gigabytes of data (where a search to export this data could conceivably take days to run), this approach isn’t all that practical.

To address this issue, there are other methods available to export data from Splunk, the full list of which can be referenced here. According to Splunk, “for large exports, the most stable method of search data retrieval is the Command Line Interface (CLI)”. Since we were facing what appeared to be over a terabyte of exported events, this seemed to fit into the “large exports” category. When working with a Splunk Cloud deployment, you don’t technically have CLI access. But this is Splunk – there has to be a way to get this to work.

Utilizing the Splunk CLI

One of the most powerful Splunk features is the the Splunk CLI. While this is not available locally for Splunk Cloud, you can request access to the Splunk Cloud management port, which we do for all of our customers. With that level of access, we have the ability to run search commands on a remote Splunk instance – which means that the CLI export method was now available with Splunk Cloud.

Estimating storage size

We started by spinning up a VM in the lab with a lot of disk space. For this example, we calculated approximately 300gb of raw syslog, over the course of a 4 day window, which approximates to 1.2TB of space. If you need to do a similar calculation, you can use the following search:

Copy to Clipboard

Getting Splunk running for the data export

In this lab instance, we installed a copy of Splunk to match the customer’s Splunk Cloud instance, which, at the time of this writing, was 6.4.x. (Note: we initially tried this with a 6.5 instance, but due to some SSL changes in that version, it was unsuccessful. Rather than troubleshooting that issue, matching the version was the simplest solution). From this system, we were able to test running searches against the remote Splunk instance using a sample command such as the following:

Copy to Clipboard

You can run this search to verify your basic connectivity and confirm you are getting data returned. However, you probably don’t want to run this as-is because by default it will return everything in that sourcetype/index combination. In the case of a massive amount of data, timeouts and bandwidth will be your enemy.

Fortunately, we have scripting to the rescue!

Once we had our CLI Access and a healthy amount of storage, that was just about all we needed to get off the ground and start exporting our data. All we needed was a little bit of python, which can be found below.

Note that the below script requires the Splunk SDK as well as progressbar.

You will also need to set up the following config.conf file before using the script. Modify the values to fit your specific needs:

Copy to Clipboard

#!/usr/bin/env python3
import configparser
import os
import sys
from time import sleep
import splunklib.client as client
from splunklib.binding import AuthenticationError
import progressbar

def run():
    dir_name = os.path.dirname(os.path.abspath(__file__))
    with open(os.path.join(dir_name, "config.conf"), "r") as f:
        config = configparser.RawConfigParser()
        config.read_file(f)
    try:
        hostname = config.get("options", "hostname")
        username = config.get("options", "username")
        password = config.get("options", "password")
        search_string = config.get("options", "search_string")
        earliest_time = config.get("options", "earliest_time")
        latest_time = config.get("options", "latest_time")
    except configparser.NoOptionError:
        sys.exit("Error in configuration file.")

if '' in (hostname, username, password, search_string, earliest_time, latest_time):
        sys.exit("Error: missing config.conf value")
    try:
        service = client.connect(host=hostname, port=8089, username=username, password=password)
    except AuthenticationError:
        sys.exit("Error: incorrect credentials")
    print("Successfully connected to Splunk")
    jobs = service.jobs
    kwargs = {"exec_mode": "normal",
                             "earliest_time": earliest_time,
                             "latest_time": latest_time}
    print("Running search")
    job = jobs.create("search " + search_string, **kwargs)
    print("Search job created with SID %s" % job.sid)

# Progress bar fanciness
    widgets = [progressbar.Percentage(), progressbar.Bar()]
    bar = progressbar.ProgressBar(widgets=widgets, max_value=1000).start()

# Wait for job to complete
    while True:
        while not job.is_ready():
            pass
        if job["isDone"] == "1":
            bar.finish()
            print("\nJob completed")
            break
        else:
            progress_percent = round(float(job["doneProgress"])*100, 1)
            bar.update(int(progress_percent*10))
        sleep(2)

event_count = int(job["eventCount"])
    print("\nDownloading and writing results to file")
    # Progress bar fanciness round 2
    i = 0

widgets = [progressbar.Percentage(), progressbar.Bar()]
    bar = progressbar.ProgressBar(widgets=widgets, max_value=(event_count-1)).start()

# Read results and write to file

with open(os.path.join(dir_name, "output.csv"), "wb") as out_f:
        while i < event_count:
            try: 
                job_results = job.results(output_mode="csv", count=1000, offset=i)
            except AuthenticationError: 
                print("Session timed out. Reauthenticating")
                service = client.connect(host=hostname, port=8089, username=username, password=password)
                job_results = job.results(output_mode="csv", count=1000, offset=i)
            for result in job_results:
                out_f.write(result)
            bar.update(i + 1)
            i += 1000
    bar.finish()
    print("\nDone!")

if __name__ == "__main__":
    run()

This situation was made extra exciting because we ended up running two Splunk instances on the data exfiltration node at the same time, one for outputting data to files and one to re-index data as those files were written. We don’t recommend this scenario long term, but it helped us set this up over a weekend and when we came back in on Monday, everything was solved. Theoretically, if you had everything in place right away, you could do this with one UF, or one HF. We were figuring this out in stages however, so we setup a UF to start downloading the data as we knew it would take a while to export the quantity of data we needed. We utilized a second Splunk instance to start testing ingesting the data to a test index to make sure it was working as expected and to fine-tune some hostname props/transforms settings. This was probably overkill in hindsight, but if you go to implement this, keep in mind you may be able to streamline things even more.

The final process

1.) This code runs a search for the data that was indexed incorrectly. In our case, anything that was indexed as “pan:log” was actually unusable. Data should have ended up in the indexes as pan:threat, pan:traffic, etc. So to identify the “bad” data that we needed to re-index, we set our search_string to “index::pan_logs sourcetype::pan:log”. The code also has two variables for “Start” and “End” times. So you can say “I want to look for this data over this span of time”. Both times should be in increments of 15 minutes. So if your most recent “bad” event is at June 4, 2017 at 10:01AM, you’ll set your start time to June 1, 2017 at 10:15AM. If the oldest bad event is June 1, 2017 at 9:37AM, you’ll set your end time to June 1, 2017 at 9:30AM. This way you make sure to capture all bad events.

2.) Next when you run the script, it will export data in 15 minute chunks and leave them in the same folder that you ran the script in. This helps alleviate any issues with timeouts that can result in trying to download 1.2 TB of data all at once. (Especially if you work from home and are plagued with a terrible internet connection.)

3.) You’ll need to decide how to re-index the data. For this script, you’ll setup a File Input to read in the files from whatever directory you exported them to. Ours looked like the following:

Copy to Clipboard

4.) Lastly you’ll need to decide if you’re going to have a Heavy Forwarder or Universal Forwarder do the work. We used a Heavy Forwarder for some very specific reasons that I won’t get into here. Especially because Heavy Forwarders are, in general, not recommended. The point being, we displayed a Heavy Forwarder in this diagram, but you could also do this with a single Universal Forwarder if you didn’t need any custom props/transforms (we did in this case).

Checking our work

Last but not least, we should at least show the results from our efforts. Below is a look at the Palo Alto logs from the 4 days in question:

Since the logs that we exported and re-indexed had a different source than the logs previously indexed, we could look across all Palo Alto logs and use a timechart command with a count to create a nice visualization. The search we used and graph are included here. The only sourcetype we filtered out was “pan:log” since we didn’t care about those events. You’ll notice that the blue line shows a count of events that came from this script. The yellow line is the count of events that came from the standard process. So what we can see is that we successfully backfilled this timeframe with all 1.2TB of data.

Copy to Clipboard

Hope this helps!

Hopefully this helps someone else in the future who ends up in a “worst case scenario” when they accidentally ingest an excessive amount of bad data that they don’t have a backup of. I know this is the first time we had to tackle this problem and in theory, though it seemed possible, it’s always nice to see theory meet practice. If anyone has tackled this problem before, finds this useful, or has any comments or questions feel free to comment below.

About Hurricane Labs

Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.

For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.