It was an otherwise uneventful Thursday afternoon. I was going about my day, helping Splunk clients onboard data and putting some finishing touches on my .conf22 session presentation, when I got a dreaded email from AWS:
Without any other fanfare and in a way that wasn’t much different from notifying you that your order of beef jerky was canceled, AWS was letting me know that nearly 15 TB of my data was now gone and unrecoverable.
Fortunately, this volume was attached to a Splunk indexer that was part of a Splunk indexer cluster. This means that, with proper configuration and planning, the failure of one (or possibly more) indexers doesn’t result in any impact to the operation of the Splunk cluster or any loss of data. In fact, the most obvious sign that there was a problem was errors in the search UI indicating that one of the indexers wasn’t able to return data (which, given the massive failure it occurred, is expected).
While the cluster is able to continue operating in this type of scenario, it’s important to take action to resolve the issue to prevent a cascading failure scenario.
While Splunk can continue to run in a cluster with a failed indexer, the cluster manager node will immediately start replicating data in the cluster to account for the failure and restore the desired state. This can potentially pose a problem, especially if you don’t have enough disk space in your cluster to handle the increased storage requirements of all of your data, which now has one less node. This is particularly an issue in smaller clusters, where losing a single cluster member could result in 25%-33% of the total storage suddenly being unavailable.
1. Don’t panic. You’ve hopefully planned for this scenario and are ready to show your Splunk administration ninja skills. Or, you just had an indexer fail and Google had you end up on this page. Either way, you’ll get this handled.
2. Let Splunk continue to operate. Unless you’re low on disk space, allow the cluster manager to continue replicating data and fixing up the cluster; it’s working behind the scenes to make additional copies of your data to make up for the lost peer node.
3. Resolve the storage issue. In my example, I needed to create a new EBS volume in AWS, attach it to the correct instance, and get it configured in the operating system. This was a pretty quick process, and one of the benefits of using a cloud provider like AWS. If this is a physical host, you might have to acquire replacement hardware or similar.
4. Get Splunk back up and running. In the case of a loss of all indexed data on an indexer, the instance will re-join the cluster and immediately begin accepting/replicating data. The cluster manager will work to restore the data on this host that was lost as a result of the storage failure.
5. Monitor the recovery process. The replication process will take some time to complete. While this is running, you’ll see the cluster manager UI update to reflect the fixup process. It’s normal to see the replication factor be met before the search factor.