Once upon a time there was a Splunk engineer who wasn’t quite sure how something worked, so he wanted to test it out. He also was a good Splunk engineer and decided to test it in the lab instead of production. The following tale chronicles his efforts to better understand the inner workings of search head cluster replication.
For whatever reason, I’ve been seeing a number of issues crop up related to search head clustering recently. I’m not going to imply that this technology is any less reliable than others, but it does introduce some management complexity to your environment – especially in terms of troubleshooting. One of these issues deals with a search head cluster where replication was out of sync and needed to be fixed.
During normal operation, the replication for search head clustering is pretty straightforward. Any apps get pushed from the Deployer to the search head cluster members. This process merges any app configurations from the local directory in apps into the default directory on the search head cluster peers. This means that the app directories on your deployer will look different from what ends up on the peers – and that’s expected/normal behavior for a search head cluster.
During normal Splunk operation, changes will occur that are made by users. These changes end up in the local directory of the apps on the search head cluster members – just like they would on a standalone Splunk instance. These changes, known as replicated changes, are automatically replicated across the cluster. This process is coordinated by the search head cluster’s captain, that is the instance in a search head cluster that is responsible for coordinating the operation of the search head cluster.
This works great until it doesn’t, and an admin needs to intervene to fix the cluster. What do you do then?
What gets replicated?
By default, not every config file in a search head cluster gets replicated. Changes made during runtime need to be made through specific configuration methods. Most notably, the cluster does not replicate any configuration changes made manually, such as direct edits to configuration files.
This doesn’t, however, mean that all of your admins will make changes in a way that gets replicated properly, especially if they have operating-system level access to the Splunk instance. Or, you might run into a scenario where you need to restore from a backup or have a failed search head cluster member and want to ensure everything is in sync. What do you do?
Fixing replication issues
Fortunately, Splunk has mechanisms available for dealing with these sorts of problems. Before trying any of these steps, be sure to understand what these commands are doing, as one mistake could result in the loss of configuration (backups are always a good idea, too). Additionally, for this specific example, I’m assuming that the search head cluster is otherwise in a healthy state.
First, it’s important to determine what system in the search head cluster is the captain. The captain controls replication, and it’s the source of truth for the local files on search head cluster members. This means that if the captain’s configuration is incorrect and you force a resync, all the other members will end up being incorrectly configured too.
To determine which member is captain, run the $SPLUNK_HOME/bin/splunk show shcluster-status command:
You’ll see that in this example cluster, ccnprodshc03 is the captain.
Next, for the sake of example, we’ll do something we shouldn’t do under normal search head cluster operation, and create a local file within an app on the search head cluster captain. I just picked the TA-eset directory since it didn’t have any existing local configuration, to make this pretty obvious:
Next, we’ll confirm on the other search head cluster members that this local change was not replicated:
Finally, we’ll perform a resync of the replicated config, which will pull in our local change from the captain:
Hopefully you won’t ever need to deal with search head cluster replication issues. However, if you do need to resync the configuration, it’s good to understand what exactly will be impacted so you’re not learning on a production environment. As you can see from this demonstration, running the splunk resync shcluster-replicated-config command is an effective method to get your search head cluster members back in line in the event of any replication issues or local configuration changes.