Venturing back to our primal roots…
There are plenty of tutorials out there that explain how to optimize your Splunk search, and for the most part they do a really good job. However, as with any situation, there are edge cases… Cases where you need to search through 13 months of WinEventLog data, totalling over 14TB. Cases where even the most seasoned of SPL (Search Processing Language) authors run screaming. It’s times like this when it becomes important to harken back to your primal roots, and forego all modern conveniences — in an attempt to get back to the basics of searching — like field extractions and tags. But, what do these primal instincts look like? I’ve boiled them down into a few simple rules for turning an “All Time” search over index=wineventlog from a “Nightmare on SPL Street” into “Done in 5600 seconds” (are these movie puns doing anything for you? We should hang out more).
As I go through these rules, think about how you would apply them to this search in order to improve its performance. As it stands, this search will basically run forever when subjected to “All Time” on 13 months / 14TB of data:
Rule #1: Limit the data Splunk touches:
Adding “index=” to your search is the single best thing you can do to improve your search — I was able to convert a never-ending search to one that completed in less than 60 seconds by simply adding an index restriction to it. Other useful fields are “sourcetype”, “source” and “host” — these are “indexed” fields, meaning that they are actually stored to disk when the data is received, rather than being calculated at search time. Sometimes, other fields are “indexed” fields as well, depending on the source (structured logs such as CSV or JSON are often configured as “indexed” fields). However, generally most fields are “search time” fields. Think of these fields like the fields in an “index” on a database table — using them is a really fast way to access your data.
Rule #2: Avoid wildcards like the plague:
Splunk lets you use wildcards, but it doesn’t use them very efficiently. “sourcetype=rsa*” doesn’t mean “look at my list of sourcetypes, get the ones that start with rsa, and search for those”. Unfortunately, it means “look at all of my events and discard the ones that don’t start with rsa”. This is disk I/O that could be better spent on just about anything else.
Rule #3: Beware your SPL ordering:
Splunk commands come in lots of shapes and sizes. Two of those shapes are “centralized streaming” and “transforming”. These commands are basically the arch nemesis of mapreduce. Splunk achieves a good deal of its performance by distributing part of the search out to the indexers, and then aggregating the results on the search head. Centralized streaming and transforming commands run in the reduce part of the process (on the search head), and force any SPL that comes after them in the pipeline to do the same. The more work you can push out to the indexers, the better the search performance, because it’s distributed amongst more systems. Some examples of these commands include dedup, table, and stats. This may seem counter-intuitive, as “dedup” and “stats” sound like excellent ways to limit the amount of data the rest of your search needs to process. In practice, however, due to their implementation, these commands end up being a bottleneck in the search.
Rule #4: Knowledge objects are the devil, part one:
I like to refer to this as Splunk’s “big ugly elephant in the room”. Just like wildcards, you should avoid both tags and eventtypes like the plague. Under the hood, when you search for “tag=authentication”, what Splunk actually does is go grab every search string that has that tag, and joins them together with an “OR”, until you end up with a MASSIVE list of search parameters. This is especially bad when you haven’t followed rule #1 and narrowed your search to a single index (let’s face it, sometimes that’s just not an option). Now, Splunk is forced to look at even MORE events that you don’t care about, to see if the field extractions (that it’s applying just-in-time as you search) match your filter. Using tags and/or eventtypes in a search, especially one that’s long-running, is easily the biggest thing that could negatively impact search performance — and it’s also the most inconvenient thing to have to live without.
Rule #5: Knowledge objects are the devil, part two:
Just like with part one, this is another thing you just don’t talk about at parties. The more data you’re looking at, the less you should be relying on field extractions. In order to filter data on these fields, Splunk has to actually read EVERY event that matches the indexed fields, apply the regular expressions or eval statements or lookups, and then determine if it matches your filter. This is where your primal instincts will come into play. Splunk is actually REALLY good at doing simple text searching. So, if you provide a bunch of text, that appears in the events you’re interested in, you’re going to limit the events in a much more efficient manner — sure, Splunk still has some disk I/O to do, but it’s not also applying field extractions to the data to tell if it matches. Much, much faster.
Rule #6: Always look at the job inspector:
This is less of a tip for how to optimize searches, and more of a tip for how to get better at it. Next time you run a search, look under the search box — you’ll see a link that says “Job”; when you click that, you’ll get an option that says “Inspect Job”. The job inspector provides a wealth of information about your search, including where it spent the most time and how it parsed your search. Some things of particular interest:
- “normalizedSearch” – This is the result of Splunk compiling your search into the internal syntax it uses to filter data
- “remoteSearch” – This is the part of the search that will run on the indexers
- “reportSearch” – This is the part of the search that will run on the search head
Remember, you want to force as much into “remoteSearch” as you can. Try searching for a common tag-like authentication, or network on an Enterprise Security search head, and see what pops up in normalizedSearch.
How do you think you can improve this search?
Let’s go step-by-step through each of the rules to see how we can get that search to return data. We were able to shrink the runtime of this search down to 5,518 seconds.
1.) First, let’s restrict the search to only the index we care about. Let’s add “index=wineventlog” to the search. Now we have:
2.) Now, let’s remove wildcards. This is easy, since the only wildcard we have is in the “host” field. See below:
3.) Reordering the SPL is going to totally change this search around. This one is tricky — we removed the “logon_types.csv” lookup, the eval for my_action, and the second search command, and replaced them with restrictions in the first search:
4.) Now, let’s get rid of tags and eventtypes as much as possible. The “windows_logon_success” eventtype comes from the Splunk Addon for Windows, and expands to “sourcetype=*:Security (signature_id=4624 OR signature_id=528 OR signature_id=540)”. We’re going to clean up the search too when we add it — you can see the other two EventCodes we list will never match with the eventtype, so we can remove them. There’s also a duplicate sourcetype match we’ll remove, as shown below:
5.) Lastly, let’s replace field extractions and see the final result. Each match on a search time field extraction was replaced with whatever that value would appear as, literally, in the event:
So… how’d you do?
Don’t be disappointed if you weren’t able to simplify that search down to the basics on the first try – even our expert search developers were surprised by the results. With practice and discipline, though, you’ll be eating Splunk searches just like your cavemen ancestors.