Control Your DNS: Using Splunk to See Suspicious DNS Activity
Building on the stock ES “Excessive DNS Queries” to look for suspicious volumes of DNS traffic
Starting from the assumption that a host suddenly spewing a ton of DNS queries can be an indicator of either compromise or misconfiguration, we need a mechanism to tell us when this happens. Here we will look at a method to find suspicious volumes of DNS activity while trying to account for normal activity.
Splunk ES comes with an “Excessive DNS Queries” search out of the box, and it’s a good starting point. However, the stock search only looks for hosts making more than 100 queries in an hour. This presents a couple of problems. For most large organizations with busy users, 100 DNS queries in an hour is an easy threshold to break. Throw in some server systems doing backups of remote hosts or MFPs trying to send scans to user machines, and we suddenly have thousands of machines breaking that 100/hour limit for DNS activity. This makes the search not only excessively noisy, but also very time consuming to tune into something an analyst can act on or even want to look at.
What we really need is a way to look at how a machine typically behaves during its normal activity, and then alert us when that machine suddenly deviates from its average. The solution here is actually pretty simple once it’s written out in Splunk SPL (search processing language). So we need to do a couple things:
- Determine a working time window to use in calculating the average
- Establish a baseline for a machine’s DNS activity
- Compare the established average against individual slices of the whole window
Let’s Look at a Real Life Example
Take an average workday of 8 hours. Oddly enough, there are also three 8-hour chunks in a day, so this is a safe window to use as a first draft. This window can be adjusted to better suit the needs of any specific environment. Make it wider for a less sensitive alert; make it narrower for a more sensitive alert. What we need to do is look at that eight-hour span and get a count of DNS events per host, per hour.
Search Part 1: Pulling The Basic Building Blocks
In this first part of our search, we are pulling our basic building blocks, including:
- Hosts making the DNS queries
- Original sourcetype of the DNS events (useful for later drilldown searching)
- Starting timestamp of each hour-window
- DNS server(s) handling the queries
- Total count for that query src within that hour
This will appear as is shown below:
Search Part 2: Starting to Clean Up Results
From here, we need to do a couple things to clean up our results. The Network Resolution Data Model includes all DNS traffic that it sees, so if your infrastructure is properly set up, that can easily be millions of events a day. Most of that is expected, so we don’t need to care about it in this search. Let’s drop those out:
This takes our results from 2906 down to 2791. Not a huge improvement, but it does drop the things we know for sure are expected and not worth an alert trigger. Next, we’ll constrain the search to hosts that have made more than 100 queries in any hour (like the original ES search does), and we’ll also drop out a common noisy host that does lots of DNS lookups – the backup server. This takes us again from 2791 down to 2315. A bit better, but still a big number. Notice that most of the counts on same-host line items are very similar, often identical. That is what will become our average. In a later step, I’ll create a macro that we can use to drop out known query sources and another to drop out known domain lookups so that these exclusions don’t have to be contained in the search itself.
Search Part 3: Starting to Look at Averages
Now that we have our query counts looking at per-host, per-hour, we want to look at a wider window to average these counts, which brings us back to that eight hour timeframe I mentioned earlier. Let’s add that:
We use the bucket command here to give us an eight hour window for each line item. From here, it’s an easy average function to see whether that host has deviated from its established pattern. We’ll use the eventstats command to run an average per src and bucket:
Search Part 4: Finding The Anomaly
And that brings us to the grand finale. We want to look for moments when a host has far outstripped its average. So let’s compare:
So here we see this one host in the 9am window of 04/30/2018 made 2124 DNS queries, whereas this machine’s average is only 775. What is this spike in activity? If we then add a field to display the queries made by the host, we can see that this machine belongs to a user who is in need of an ad-blocker. In this case, it shows us that the typical morning web browsing also indirectly pulls in thousands of other DNS queries in order to satisfy all the ad-network traffic that’s hosted on an average website. So, small side note: a corporate-wide ad-blocker policy not only will keep your user machines safer, it will lessen the load on your DNS infrastructure and make it easier to pinpoint these anomalies. I would not recommend adding this display function to an alert search, as it will make the notable event output /very/ large, and also slow the search down significantly. Use it as a visibility tool /after/ the alert trips.
Search Part 5: Adding Mechanisms for Ease of Maintenance
Now let’s add those macros I mentioned to help us clean up the search and more easily manage any necessary exclusions we want to account for. In this search, we have two things we want to exclude: known URL queries, and heavy query sources that we know are not an issue. To do this, we’ll create two macros and drop them into the tstats portion of our search:
The way these macros work is to take the value supplied in the config and substitute it into the field you specify in the macro parentheses when you call it in a search:
So we see in the above code that we’ve dropped out all the DNS queries and query-source hosts that we’d previously had to do directly in the search. We can also see below that in the macro config, we can easily add future entries that we may want to drop out of our results. Say we add a new primary DNS host or a new mail server. Just drop the IP into a new line item in the “known_dns_src” macro, and your correlation search is automagically updated. I’ve used this same approach to easily drop RFC1918 addresses out of searches when I’m looking for external address activity in a log type or datamodel. This method also carries the added benefit that it works in tstats searches as well as normal searches, so you’re less likely to trip up on the very specific logic formatting in tstats functions.
If you want to do something similar, that macro is:
Search Part 6: Looking Further Into Our Results
Looking back at our DNS traffic from this host though, we see all that ad-network traffic. It’s not an emergency, but something we’ll want to look at cleaning up. But how can we parse through all that ad-network noise and see if there’s something more malicious hiding in there? Well we often see machine generated domains pop up in DNS traffic when something’s not right on a host. These MGDs also typically are quite long. So we can parse the queries here and look for long domain segments.
Take our existing search, and add some functions to the bottom:
What we do there is an mvexpand to split our previous multi-value query field into one line item per query. Then we use the truncate_domain macro to get a clean query domain without the URL characters. We also drop out Amazon and the Alexa top sites list to lessen the typical noise. The more you can lower your noise floor, the more you can focus on the signals. Now, we split the query up at the dots so we can see each URL segment. Stay with me. And /then/ we look for any segment that’s over 25 characters. Finally, we drop back in all of our fields, and we see that there were two domains out of the ~2100 results from this host that meet the long domain segment criteria. You can of course adjust that threshold to something wider or narrower to fit your particular environment.
Just a final detail on the search – as this uses tstats out of the datamodel, then does rather basic manipulations at search time, it’s a fairly lightweight cost to your ES environment. In our final example, my search log shows:
This makes it not only useful as an alert, but is not too costly to run as-needed for in-depth investigations.
So now you should be able to easily spot suddenly voluminous DNS traffic from your internal hosts. Go give ’em a query response.
A tiny bit of background info:
About Hurricane Labs
Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.
For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.