A three member search head cluster can tolerate a one member failure. There, I said it. Maybe in some future version of Splunk this functionality will change – but in Splunk 6.6.3, this is a fact. For some reason, this myth seems to persist throughout the Splunk community and so I wanted to clear it up once and for all. But not without backing up my claim with supporting evidence and Internet research, of course.
For those of you who aren’t familiar with search head clustering, the idea is that instead of having a single search head handle scheduling and executing of jobs, you can utilize multiple search heads that stay in-sync with each other. One of the members of your search head cluster will have the role of the search head cluster captain to coordinate the scheduling of jobs. This captain role is set dynamically via an election process. The way this technology is designed will get you three key benefits: horizontal scaling, high availability, and no single point of failure. This is where we first encounter some evidence that a three member search head cluster has the potential to survive one member going down, and it’s found at the very start of Splunk’s documentation on search head clustering.
If Splunk wasn’t able to tolerate a single member failure, then we’d have to assume that it didn’t provide one of the key benefits: “high availability”. As a result, instead of proving that Splunk can’t tolerate a single member failure, I’d like to take the approach of proving that Splunk can, in fact, tolerate a single member failure. So let’s look at the evidence.
The explanation I get for this myth about three member search head clusters not tolerating a single member failure, is generally something along the lines of:
“If the captain goes down, only 2 members remain. Both members will be greedy and constantly vote for themselves since there is no majority. As a result, a new captain can not be elected.”
There are two things that need to be broken down to understand why the above is a myth:
The understanding of how many members are needed to elect a captain.
The captain election process.
How many members are needed for election
If one search head cluster member was to go down in a 3 member cluster, 2 members would remain. That would mean that 66% of members remained and according to the docs, enough members exist for a new and successful election. Now, if 2 members went down, only 33% of members would remain, and yes – the failure would not be tolerated. My source for this statement is this excerpt from Splunk’s very own documentation on Captain Election:
“To become captain, a member needs to win a majority vote of all members. For example, in a seven-member cluster, election requires four votes. Similarly, a six-member cluster also requires four votes. The majority must be a majority of all members, not just of the members currently running. So, if four members of a seven-member cluster fail, the cluster cannot elect a new captain, because the remaining three members are fewer than the required majority of four.”
Since captain election requires 51% of all members of the cluster, we can rule out point number 1 and know that in a 3 member search head cluster, 1 failure would be tolerated for this criterion.
The captain election process
In order to elect a new captain, Splunk has to go through an election process. In our three member example, we want to first make sure we have enough members for a successful election. So, with our 66% we should be fine.
In order to elect a new captain, Splunk considers a couple of factors for all running members. For starters, the cluster will want to elect a preferred member and one that is in-sync. We’re going to assume that all of our cluster members are synchronizing and preferred. If you have a cluster where members aren’t synchronizing, then please see a different blog because that is likely a whole host of other issues. If you don’t know what I’m talking about when I say “preferred captain”, then very likely this setting is default in your environment and all members are preferred. There is typically no need to change this setting.
In our specific case, according to the first criteria for captain election, each of our remaining members has the potential to become captain when one member fails because they are all preferred and all in sync. But there is a second criteria here, and one that will differentiate these two remaining members and allow one to become the almighty new captain in the case of a failure. That configuration setting is election_timeout_ms.
I’ve never seen an instance in my Splunk career where I’ve had to alter this setting, but it’s the crux of how a three member search head cluster will tolerate one member going down. search head clustering uses something called the Raft consensus algorithm. If you’re unfamiliar with this, and you’d like to know more, then I’d suggest checking out this really handy visualization on The Secret Lives of Raft. In short though, this election_timeout_ms is what is going to give preference to one of your two remaining members. In order to demonstrate this, I’ve given you an example of what a cluster would look like in three different states.
Breakdown of a Captain Failure
This is our perfect working cluster. Everything is synchronized and happy. You’ll notice that ccnprodshc03 is the captain. So to simulate an outage, I will shut off Splunk on ccnprodshc03.