Splunk’s search language flexibility can lead to less than optimal search performance if the searches are poorly written. It’s pretty common to see Splunk searches that use a wildcard (*) in order to represent multiple possible values. However, wildcards can have a significant impact on Splunk search performance (as well as the accuracy of search results), and should be used carefully. This is especially true when leading wildcards are used in search terms.
I’ve put together this quick demonstration of how wildcards impact search performance.
As you can see in this video, the search performance varies significantly based on how wildcards are used. A trailing wildcard (e.g., Acc*) yields similar performance in this example to a search without any wildcards, whereas using a leading wildcard (e.g., *ccept) quadruples the execution time of the search.
If you simply want to improve your search performance, you can stop here. The next section covers why this occurs in more technical detail–but I’ll warn you, it’s a bit complicated.
Why do leading wildcards impact search performance?
This behavior is related to how Splunk handles data processing under the hood. I’ll attempt to explain this in the simplest way possible to avoid confusion. Please note that the actual behavior is a bit more complex than described here.
When Splunk ingests data, it breaks events into searchable segments, which are stored in files that Splunk accesses when a search is run (this is the TSIDX file, or the 35% of disk space that represents the search factor when making a disk space calculation).
The existence of segments is what allows for various terms to be searched by Splunk. In general, most special characters or spaces dictate how segmentation happens; Splunk actually examines the segments created by these characters when a search is run. These segments are controlled by breakers, which are considered to be either major or minor. Segmenters.conf is the config file that controls this behavior.
For example, here is a syslog message:
By default, the space and brackets in this message are major segmenters, and the periods, colons, and dashes are minor segmenters. For this example, I’m only going to focus on the major segmenters.
This means when a search is run, Splunk looks through a list of terms that looks something like this:
Let’s say I’m searching for “admin.” I can quickly find it in this sorted list. Now, consider the impact of using wildcards with this list. With a trailing wildcard, the behavior isn’t much different from what happens when an absolute term is defined. If I’m searching for adm*, I’ll first find everything that starts with adm, and then only look at those results to see if any need to be discarded.
Things are significantly different when a leading wildcard is in use. Instead of quickly finding what I’m looking for in the list, I now need to look at every single line and compare it to see if it is a match. Try doing the same thing, by looking at the list for *dmin. I guarantee that it’ll take you longer than just looking for admin, and the same thing happens for Splunk. Essentially, you end up having to look at everything in the list in order to see if it’s a match.
If you want to read more about this, here are some references from the Splunk documentation to consider:
If you’re a Splunk user, hopefully the first half of this tutorial was helpful for seeing the impact of wildcards on your searches. If you’re a Splunk administrator, hopefully this article is helpful when you need to explain this to your users. And if you love digging into the technical details of things like me, hopefully you have a better understanding of what Splunk is actually doing to make all your data searchable.