If there aren’t any specific system requirements (such as the aforementioned 16×32 for ES indexers and search heads), then the hardware specifications listed here would be what a standard Splunk system is recommended to look like. For resource constrained splunk systems, you may want to look at the Minimum Recommendations to start out with, and grow from there.
For a smaller deployment, this level of resources per system may not be something that can be practically virtualized. In a lot of cases, it makes sense to dedicate physical hardware to roles that require it (indexers and, depending on your environment, search heads as well) and supplement the other components of a distributed environment with virtual machines. If we have a limited number of resources in the virtual environment that can be dedicated to Splunk, how do we decide how to allocate those resources without impacting the Splunk experience?
Making the right decision
In a resource-constrained virtualization environment, there may be plenty of pressure to not assign the recommended resources to any given Splunk instance. Fortunately, some Splunk systems generally will show lower utilization compared to others, but there are caveats associated with reducing the resources allocated to each.
In general, we do not recommend reducing the system resources of any Splunk instance below the recommended reference hardware. In reality, most of the systems that fall into this category have usage patterns that are burstable in nature, as opposed to a constant load. So even though a server may show low usage at times, the potential for bursts and growth in a deployment are very real.
Ad hoc search heads
The ad-hoc search head (or ad-hoc search head cluster) consists of search heads not running a premium application such as Splunk Enterprise Security (ES), Splunk App for PCI Compliance, or Splunk IT Service Intelligence (ITSI). This is where any ad-hoc reports or searches will run. In the case of a search head cluster, members work together to handle scheduled searches, whereas an ad-hoc search is executed locally by whatever member the user is directed to by your load balancer. Overall, the search load on either a single search head, or a search head cluster, is varied based on scheduled search load and user volume. Generally speaking, the load on an ad-hoc search head is less in comparison to search heads for premium apps.
One of the key factors to consider is the scheduler. The search head scheduler is the backbone of your search head as it prioritizes concurrently running jobs. If you saturate your Splunk system with searches, you’ll typically run into issues like searches skipping and not completing. This is the primary reason why it is recommended to have separate Search Heads – one for ad-hoc searches, and one for premium apps. Inherently, premium apps are going to come with a healthy amount of scheduled searches so it’s best to split those two apart.
If a search head or search head cluster is consistently seeing skipped searches, something is wrong. There are many reasons this can happen including searches which are configured to be inefficient (real-time ones are the biggest culprit) or there is potentially a sizing issue. Always look for skipped searches when troubleshooting any search performance related issue. We have a fairly comprehensive read on this topic that can be found here.
That said, not following the Splunk reference hardware for a Search Head and allocating too few resources will decrease the available number of concurrent searches that can be run simultaneously on a search head. This is due to how Splunk calculates limits on the scheduler. Number of CPUs is used to calculate the number of historical searches that can be run, as well as several other limits.
Reducing the CPU cores would significantly reduce the search capacity and potentially result in skipped searches, but the risk would be localized to whatever search head had reduced CPU allocations.
If you are considering running a search head below the reference hardware recommendations, keep a few things in mind.
- Ensure your search requirements are very light and data volume to be searched is low
- Don’t have a lot of Splunk users that need to use the system concurrently
- Don’t have a ton of saved searches
- Don’t run a premium app (ES, ITSI, etc.) on a search head with resources below the Reference Hardware
- Accept the risk that search performance may suffer and scheduled searches may be skipped
With these limitations, it may not make sense to jeopardize the usability of a Splunk instance by reducing the resources of a search head.
Cluster Master Node
The master node is responsible for directing replication and search traffic in the environment. It is a critical system in terms of maintaining environment stability and redundancy. Realistically, the role of the server is generally going to be pretty light except when performing indexer bucket fix-up tasks. We would not recommend skimping resources on this box even if it appears to have low utilization since its role is inherently important for ensuring integrity of the data in Splunk across the cluster.
The deployment server is a common system that we see undersized. Ryan O’Connor has a blog post on our site that covers this in pretty good depth – Splunk Answers: Dealing with undersized deployment servers.
To summarize – the performance of the DS is highly dependent on the number of clients you have checking in, and the resources allocated to it impact the ability to make configuration changes to Universal Forwarders. Starving this system for resources can result in issues with pushing out configuration changes to forwarders (I have seen undersized systems become unusable in this situation). Reducing the phoneHomeIntervalInSecs setting can be an effective way to better spread out the load, but it does delay the propagation of changes to UFs.
As an example, with ~500 clients, a deployment isn’t at the 2,000 client interval that Splunk uses for their 12c/12gb recommendation, but this is still a sizable number. If resources are reduced on the deployment server (which should never be less than 4c/4gb), we would want to coincide this with increasing the phone home interval – to something significantly higher, like 5-10 minutes.
Heavy forwarders are systems that are primarily used for tasks such as receiving syslog or calling APIs. In the case of syslog, two components generally work together to ingest logs – syslog-ng and Splunk (either a UF or an HF). As a result, system load ends up being dependent on the volume of syslog and on the choice you make between HF and UF (https://www.splunk.com/blog/2016/12/12/universal-or-heavy-that-is-the-question.html). There is inherent risk in data loss if a Heavy Forwarder is starving for resources and unable to consume all incoming syslog. This is due to the stateless nature of some syslog (UDP vs TCP), and also due to the fact that Splunk may not be able to process all of the data as fast as it would need to, resulting in blocked queues.
So how can you be more lean with a Heavy Forwarder? With a lower volume of syslog and your Universal Forwarders sending logs directly to your indexers (and not leveraging a Heavy Forwarder), we have seen systems smaller than 8c/8gb work successfully at some clients. Typically these are environments with licenses below 50GB/day, where syslog data makes a small percentage of the overall data ingested.
We certainly would not recommend anything smaller than 4c/4gb for a Heavy Forwarder, and you must understand that that number is well below the Splunk reference hardware so your results may vary. This type of system specification would require some potentially advanced management and fine tuning of data ingestion. One thing that has helped systems sustain proper data ingestion and cut down on resources, is the aforementioned blog on using a Universal Forwarder in place of a Heavy Forwarder. It’s a detailed topic, but we did include it because it will give you the ability to move a lot of the heavy lifting of your data to your indexers and allow for a leaner syslog collector.
Where to not cut corners
Some Splunk systems should always meet or exceed the minimum system requirements – this is especially important for indexers (don’t ignore the 800-1,200 IOPS requirement, SSDs are your friends here) and search heads (especially those running premium apps such as Splunk Enterprise Security). While it’s tempting to think your smaller environment will do fine on a indexer with a lower capacity or a search head with less than 16 cores – ES still runs the same type of searches regardless of your license size. Out of the box, it even runs some searches in real-time. The number one reason we see for poor ES performance (and even a desire to abandon the product entirely) is lack of sufficient resources on indexers and search heads powering ES.
Hopefully this helps you make an informed decision on what changes to make to the resources allocated to your Splunk systems. If you have any specific questions about your deployment situation, don’t hesitate to reach out. Thanks!
P.S. Part 2 is now available too! Check it out!