Identifying Splunk Search Duplicates with jellyfish and Jaro-Winkler

By |Published On: January 9th, 2020|

Sniffing out a way to improve ES performance

In large Splunk environments with many users, the possibility that your search infrastructure could be working double time running duplicate searches is increased. This can possibly lead to degraded performance, especially on Enterprise Security search heads, which have the added overhead of running the extra tools included.

To counteract this problem, we developed a script to help sniff out searches which may be duplicated across search heads.

A word about jellyfish and Jaro-Winkler

To compare two searches, we need to have a metric that can give us an easy to understand method of comparison. Enter: Jaro-Winkler.

Jaro-Winkler is an algorithm that takes two strings and gives a value between 0 and 1 to represent how similar two strings are, with 1 being an exact match. Jaro-Winkler is a modification of the Jaro algorithm which gives greater weight to the beginning of the string. This is perfect for Splunk searches, as searches are executed in order.

Instead of implementing this algorithm ourselves, we can use the very useful Python library jellyfish. The jellyfish library implements several useful string comparison algorithms, including Jaro-Winkler.

Note: Please ensure you are downloading the correct library. Recently, a malicious library named “jeIlyfish” (a capital “i” replaces the first “l”) was found to be stealing SSH and GPG keys of it’s users. This shouldn’t be a problem if you use pip install (as a typo is unlikely), but if you regularly set up new Python libraries directly, take special care.

The Script

Python 2

Copy to Clipboard

Python 3

Copy to Clipboard

You will need to be able to execute remote REST searches with your account to be able to run this script. To use, simply run the script from the command line and fill out the necessary information. Output will be written to the current directory in the file output.txt by default.

The format of the output is as follows:

Copy to Clipboard

In testing, I found that a value above 0.9 should prompt serious consideration into whether or not you need to run the search on both search heads.

Happy Splunking!

Hopefully, this script will help you eliminate any parts of your Splunk search infrastructure that are working unneeded overtime, thereby improving your performance.

Share with your network!
Get monthly updates from Hurricane Labs
* indicates required

About Hurricane Labs

Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.

For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.