Splunk is a robust, scalable software that allows you to comb through your diverse machine data, or “big data”, at a more efficient pace than a regularly structured database. However, it’s no secret that the more data you’re feeding into Splunk, the more resources will be required to power the beast. It’s great when Splunk’s potential can be maximized, but there are a number of considerations to take into account, particularly when it comes to performing real-time searches.
Having the ability to know about the bad issues before they blow up is pretty important, right? For particular cases, such as the possibility that your user-facing environment is going down, you probably want to know about it so you can address the issue. If you aren’t skimming through your system’s logs 24/7 and something bad like this happens, the idea of having to file a postmortem report and praying that you don’t lose your job becomes a very real thing…
Splunk provides the ability to see something as it’s happening with its real-time search capability. There’s a special feeling production support teams get when there’s a giant screen in your team’s office flashing red like all hell has broken loose to alert them that something needs to be fixed ASAP – before receiving a call from the boss asking what the heck is going on. Or, during the already crazy time of Black Friday, for example, it’s good to know immediately that your ecommerce site is fully taxed before it crashes and shoppers move to the next retailer.
For some environments and IT teams, viewing a real-time dashboard and being able to see events and correlations as they’re happening is ideal and may help to mitigate risk.
“The Great” (for Admins)
One great use for real-time searching is for administrators when attempting to troubleshoot a data source, or confirm that data is coming in. Running an ‘all-time real-time” search is significantly helpful when looking to verify new or troubled data sources. This is not a “scheduled” search, but rather an ad-hoc search that an administrator can write “on the fly” when he/she needs to verify that data sources are behaving as expected.
Although there are good uses for real-time searches, you should still be weary, as using Splunk for real-time searching is taxing to the system. Real-time searches are constantly using your resources (CPU, memory, maximum amount of concurrent searches allowed, etc.). While the real-time search itself may not require all of the system’s resources, it will definitely take away from the resources available at all times.
Note: If you are looking to use real-time searching, it’s best to include this requirement when building out your infrastructure. Architects and admins can make real-time searching work more efficiently by including more servers or CPU, if they know what you are looking to achieve.
Splunk searches operate “per core.” This means that each running search will reserve its own CPU core. If a real-time search is always running, it will never release the core to be used by other searches. For example, let’s say you have a 16 core Search Head. Using default settings, this search head will allow 22 concurrent searches ( “# of cores + 6”). If there are 4 real-time searches running, which are never released, and 20 scheduled searches that run at every hour, you will end up with at least two of those searches being skipped each hour.
Scheduling these searches to run in short increments can resolve the potential “constant lack of resource” issue, or “core-hogging,” while still having reports available within minutes. The question you need to ask yourself is: “how critical is having data in real-time as opposed to near-real-time?” To answer this, let’s take a look at a scenario.
Real-Time vs. Near-Real-Time Scenario
In our experience with large customers that use Splunk as their SIEM, when alerts come in, they need to be addressed as a proper SOC would (“immediately”). Between engineering work, and handling other alerts, it may not be feasible to respond to all these alerts in “real-time.”
So, what’s the next best option? Responding in near-real-time. I am defining this as “alert me every 5, 10, or 15 minutes if something is blowing up.” The more critical the alert, the more often it runs.
The benefits of this type of engineering are substantial, such as:
Reducing the amount of time your team spends chasing false positives.
Allowing scheduled searches to search longer periods of time will enable Splunk to report on more events. This way, you can have your search report on multiple events or incidents with a single search, as opposed to alerting per event.
Allowing Splunk to catch up on resources.
Running a 15-minute search in about 30 seconds or less (depending on the complexity of the search and amount of data in the time span) can greatly reduce the amount of resources required from the Splunk server.
Creating trends with your data.
Running these searches at intervals over a certain timeframe is unmistakably crucial if you’re attempting to do anything fancy with your data, whereas running real-time searches would just give you “oh look, this event just happened.”
Depending on what you’re trying to do, such as creating complex trending searches, doing so in real-time (since so many resources are being dedicated to the real-time searches) would probably not create the best end-user experience. This is why there can be a lot of benefits to using near-real-time scenarios over real-time.
Want to continue increasing the use of this tool, while decreasing your headaches? Bottom line, if Splunk is kept as efficient as possible, your search capability and speed will be optimal. While using real-time searches can prove beneficial in some use cases, scheduling saved searches in lieu of real-time searching can get you the results you’re looking for nearly as instantaneously while assigning resources appropriately.