Using AWS Auto Scaling Groups with Splunk

By Tom Kopchak|Published On: November 19th, 2021|

AWS Auto Scaling groups allow you to dynamically allocate resources for different types of usage scenarios. This approach can be very effective for allowing an application to scale to an unpredictable and varying level of users and needed compute resources. However, in order for auto-scaling to work properly, the application must be designed with this type of technology in mind.

How do I leverage AWS Auto Scaling groups for my core Splunk Infrastructure?

You don’t!

Can you elaborate?

Sure! Traditional Splunk infrastructure assumes that components of the environment exist throughout the operation of the environment. Traditional Splunk Enterprise software was designed for a traditional deployment strategy–the software runs on servers that each have a specific role in the environment. This has since been adapted to be more cloud-centric, but even today, for most Splunk infrastructure components, that hasn’t changed–Splunk Enterprise running in the cloud is still deployed using an infrastructure as a service (IaaS) model.

When a search is executed, that search will be distributed across all of the active indexers in order to retrieve data stored on each host. The disappearance of an indexer when a search is being run can lead to incomplete results or errors in the search execution.

Additionally, there is significant overhead for data replication when an indexer becomes unavailable in an indexer replication cluster. By default, an indexer replication cluster will always work to ensure that the search and replication factors for your data are met. This means that within a few minutes of an indexer going down, all of the indexers will be working to make up for the missing node, and copying/resyncing/rebuilding any buckets that no longer exist in order to ensure a good state. This can result in significant changes in the amount of data stored on each indexer, as well as compute overhead associated with rebuilding the search factor for any data where it no longer exists.

This also becomes a problem for search heads when Search Head Clustering (SHC) is in place. There are two types of replication that occur within a search head cluster: configuration replication and KVStore replication, both of which happen independently.

While the configuration replication can often recover from an extended outage without too much trouble (a destructive resync may still be necessary on the recovered member), it’s likely that the KVStore will become stale and require a complete resync in order to function properly after an extended outage of the node.

I don’t know about you, but fixing KVStore replication issues ranks pretty low on the list of things I enjoy doing when working with Splunk.

A separate issue with search head clustering is the captain election process. Due to how the Raft protocol functions, a search head cluster requires a quorum of more than 50% of the total number of configured nodes (not just the currently active ones) in a cluster in order to elect a captain. If more than 50% of the nodes are unavailable, captain election won’t happen. Without a captain in a search head cluster, Splunk won’t automatically run scheduled searches (without manual intervention).

Does SmartStore help with this?

Yes, to some degree. SmartStore removes several of the data replication challenges associated with the scaling of indexers by decoupling the storage from the compute hardware. Not having to re-replicate all of the data on an indexer significantly reduces the overhead associated with bringing new indexers online.

That said, there are still some important issues to consider:

There is overhead associated with filling the SmartStore cache from S3. Ideally, the majority of your searches in SmartStore will use the cache for returning results (which should exist on an NVMe instance store on an i3en (or similar) instance). Having to download results from S3 results in significantly increased search times, as the data is not immediately available for use.
Hot buckets are still replicated when SmartStore is in use, so there is some overhead associated with re-replicating this data once an indexer goes offline.
Buckets on an indexer are NOT uploaded to remote storage upon instance shutdown.
The loss of an indexer during a search will result in errors about a search peer becoming unavailable. Search results during this failure may be unpredictable, and searches may completely fail to execute properly.

Could this advice change in the future?

Absolutely. As Splunk becomes more cloud-centric, it’s likely that there will be improvements and changes focused around being more flexible with infrastructure. There are also several projects around using Splunk with Docker or Kubernetes that may add support for this type of feature in the future.

Other Notes on Auto Scaling

Auto Scaling Groups won’t (currently) work for a Splunk SOAR/Splunk Phantom cluster either. All of the nodes in one of those clusters need to be up and running all the time, and downtime of an individual cluster member can result in members quickly going out of sync and needing manual intervention to allow clustering functionality to work.

Conclusion

Hopefully, this guide helps you make the right decisions for designing scaling for your Splunk environment–and keeps you from needing to troubleshoot another broken search head cluster. If you have any questions about Splunk deployment best practices, reach out to us!

About Hurricane Labs

Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.

For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.