Fixing high CPU usage in Splunk Stream
If you’re like me, you love Splunk. It’s an amazing tool for monitoring and troubleshooting your systems. But there’s one thing that can drive a Splunk sysadmin crazy–high CPU usage. With this in mind, I’m going to show you how to identify and reduce high CPU usage in Splunk Stream.
Let’s get started!
Introducing Splunk Stream
First, a little about the Splunk App for Stream.
The advantage of this Splunk app is that it allows for the collection of wire data from many different source types that might be otherwise difficult to capture in Splunk.
As a matter of fact, at Hurricane Labs we use Splunk Stream to collect DNS event data as part of our comprehensive security alerting services. However, despite being admirers of the app’s capabilities, deploying it has been known to take a toll on CPU utilization of the server managing distributed Stream forwarders.
If you’ve had this happen to you, just know, you’re not alone!
Need help managing your Splunk environment?
Let’s connect to discuss your requirements and how our Splunk experts can help.
Troubleshooting symptoms of high CPU usage in a distributed Splunk Stream deployment
In a distributed deployment mode, you have a Splunk Enterprise instance that functions as a management node for the Universal Forwarders which are collecting data. The Splunk Enterprise host runs the splunk_app_stream app, and the Universal Forwarders (UFs) run the Splun_TA_stream app.
Get an informative overview of this type of deployment with Splunk’s helpful diagram.
Now, once you’ve configured a distributed Splunk Stream deployment, you may see high CPU utilization on the Splunk Enterprise instance where splunk_app_stream is configured. This is often due to the overhead of a large number of Stream UFs sending traffic to a web address on your system, https:///en-us/custom/splunk_app_stream/ping. By default, this ping event happens every 5 seconds.
To view incoming requests, you can tail the splunkd_ui_access.log and observe incoming requests. At this point, if these logs are flying by, there’s a good chance this is the cause of the high CPU utilization you are seeing.
Below you will see sample logs. Note that all these events occur in around 3 seconds.
You can further confirm the issue by logging into the Splunk Enterprise instance, opening up
top on your terminal, and pressing the c key to view the command associated with each process.
In this example, you’ll see a large number of Python processes running that are related to
The number will vary depending on the number of clients checking in. However, when I see this issue occurring, there’s always a bunch of these that show up pretty consistently while watching the process list refresh.
Now that you’ve identified this problem, let’s fix it.
The solution to the high CPU utilization problem
Remember when I mentioned that the default ping interval in stream is every 5 seconds? As you can imagine, this can result in a significant number of incoming connections to your Stream server, which then results in a high CPU utilization situation.
To fix this issue, you’ll need to apply a streamfwd.conf to the Splunk_TA_stream app that you are pushing out to your Universal Forwarders running Stream and collecting data. Additionally, you’ll likely need to do this via the Deployment Server or whatever other configuration management mechanism you are using in your environment to get apps to UFs.
Word of caution: this change MUST be made in the app where the streamfwd binary is running, typically Splunk_TA_stream. If the streamfwd.conf file exists elsewhere, it will not apply. This is different from how you would typically expect Splunk configuration precedence to work.
As can be seen on my demo system, the conf file on the deployment server looks like this:
Once this change is made and the UFs pull down the new config, you should see the CPU utilization on the Splunk Stream Server drop significantly. Furthermore, in larger environments, you may look to further increase the pingInterval depending on the number of UFs you have checking in.
In conclusion, this is a pretty common issue I’ve worked to resolve at a number of Hurricane Labs Splunk clients. If this is something you’ve experienced in your Splunk environment, I hope this tutorial will help you resolve this issue!
About Hurricane Labs
Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.
For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.