Preventing data loss with Splunk indexer clustering

By Tom Kopchak|Published On: May 16th, 2022|

It was an otherwise uneventful Thursday afternoon. I was going about my day, helping Splunk clients onboard data and putting some finishing touches on my .conf22 session presentation, when I got a dreaded email from AWS:

Copy to Clipboard

Without any other fanfare and in a way that wasn’t much different from notifying you that your order of beef jerky was canceled, AWS was letting me know that nearly 15 TB of my data was now gone and unrecoverable.

Now what?

Fortunately, this volume was attached to a Splunk indexer that was part of a Splunk indexer cluster. This means that, with proper configuration and planning, the failure of one (or possibly more) indexers doesn’t result in any impact to the operation of the Splunk cluster or any loss of data. In fact, the most obvious sign that there was a problem was errors in the search UI indicating that one of the indexers wasn’t able to return data (which, given the massive failure it occurred, is expected).

Recovering

While the cluster is able to continue operating in this type of scenario, it’s important to take action to resolve the issue to prevent a cascading failure scenario.

While Splunk can continue to run in a cluster with a failed indexer, the cluster manager node will immediately start replicating data in the cluster to account for the failure and restore the desired state. This can potentially pose a problem, especially if you don’t have enough disk space in your cluster to handle the increased storage requirements of all of your data, which now has one less node. This is particularly an issue in smaller clusters, where losing a single cluster member could result in 25%-33% of the total storage suddenly being unavailable.

Recovery Steps

1. Don’t panic. You’ve hopefully planned for this scenario and are ready to show your Splunk administration ninja skills. Or, you just had an indexer fail and Google had you end up on this page. Either way, you’ll get this handled.

2. Let Splunk continue to operate. Unless you’re low on disk space, allow the cluster manager to continue replicating data and fixing up the cluster; it’s working behind the scenes to make additional copies of your data to make up for the lost peer node.

3. Resolve the storage issue. In my example, I needed to create a new EBS volume in AWS, attach it to the correct instance, and get it configured in the operating system. This was a pretty quick process, and one of the benefits of using a cloud provider like AWS. If this is a physical host, you might have to acquire replacement hardware or similar.

4. Get Splunk back up and running. In the case of a loss of all indexed data on an indexer, the instance will re-join the cluster and immediately begin accepting/replicating data. The cluster manager will work to restore the data on this host that was lost as a result of the storage failure.

5. Monitor the recovery process. The replication process will take some time to complete. While this is running, you’ll see the cluster manager UI update to reflect the fixup process. It’s normal to see the replication factor be met before the search factor.

Need a Hand? Hurricane Labs is Here to Help?

6. Clean-up. Once the cluster is in a good state (both SF and RF are met), you’ll likely have some extra replicated data to clean up. You can remove these by clicking on the “Bucket status” button on the indexes tab of the cluster manager UI, and then clicking on the indexes with excess buckets.

Note: Don’t remove excess buckets until the cluster has fully recovered from the outage to avoid removing copies of data you might need if there’s another failure before everything is resolved.

7. Data rebalance. In a situation where all data was lost on a single indexer, you likely won’t have the most even distribution of data once the recovery process is complete. You can initiate a rebalance of the data from the cluster manager UI as follows:

With those steps completed, you’re done! Your Splunk environment should be back to operating data, and the clustering process is keeping that data safe in the event of another issue.

Here’s a few key takeaways to keep in mind when planning for or experiencing a Splunk clustering related issue:

Key Takeaways

Ensure your Splunk cluster is configured properly. Ensure that you’re replicating data, and have at least a search factor of 2 and a replication factor of 2 so that you can continue normal operations with the failure of a single indexer.
Use separate volumes/partitions for the operating system and application data. That way, a failure of one won’t cause the entire system to fail, and your recovery time will be much quicker.
If your environment is deployed in AWS, deploy your indexers across multiple availability zones, and configure multi-site clustering at the availability zone. This way, you can ensure that a failure of an entire availability zone won’t result in any data loss, as long as your cluster replicates an entire copy of data to each site.
Don’t push the limits of your storage. Size indexers to have enough free space during normal operations to be able to handle the failure of a single indexer and taking on the data from that indexer. Environments that don’t have enough available space may suffer from a cascading failure scenario, where the loss of a single indexer and corresponding replication will cause other indexers to run out of disk space and stop accepting incoming data.
Have an incident response plan–and test it. Be prepared to respond to a failure like this one and know how to involve the right people on your team to resolve the issue. This could involve some combination of Splunk, server, storage, and cloud administrators, depending on the design of your environment.

Conclusion

While I wouldn’t wish a storage failure on anyone, it’s the unfortunate reality that we live in a world where these sorts of things will happen. Hopefully, this tutorial will help you be better prepared for dealing with this type of issue or responding to one if it were to happen to you.

About Hurricane Labs

Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.

For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.