Cloud services have made some of these Splunk queries a little more unwieldy, but given time and enough filtering of benign results, they still have promise.
Visualization to detect higher than average number of DNS queries
By and far my results were very inconclusive. How so? Well, let’s start with the easier of the queries to pick apart, the Splunk queries in which where are just summing up the number of DNS queries (in total and by type).
The following is a DNS query that will count out the number of queries for the time period you are querying, and time-chart it.
The purpose of this query is to determine times of the day in which there is a noticeable uptick in DNS queries being made. This could be indicative of DNS tunneling, or other malware infections, and probably warrants investigation.
Issue 1: Scale
The problem with attempting to use this query in an enterprise network is in comparison to Steve’s lab environment, impressive though it may be, it is nowhere near the scale of an enterprise network. Therefore you have much more data to sift through. So much so that the original query mentioned in Steve’s paper, setting the span option to 1’s, results in the time-chart option failing to run on the query because there is too much data to plot for the default query time period of 24 hours. I modified the command to use a span of 10’s, but that may or may not be optimal.
I’m not a Splunk expert, I just play one on TV. Also, enterprise networks vary in size and complexity so your mileage will definitely vary there.
Issue 2: Baselines
The other issue you run into is that this type of query assumes you have baselines on what is considered a normal number of DNS queries for various times of the day, which is almost never the case. So for this query to be of any practical use in the real-world, you need to run it multiple times over multiple days (or multiples of whatever time period you ran the query for) to establish that baseline, compare and contrast.
The first time you run this query you might see a humongous spike at say, 8am, or 1pm, or 5pm. These are normally times in which people are browsing the internet before work, at lunch, or at the end of the workday.
Are they normal? We have no idea, because we need a baseline or multiple results to compare and contrast to. What about huge spikes that occur after work hours? Are they necessarily bad? No idea. It could be automated patching or other maintenance jobs running or it could be bad traffic. We have no frame of reference without a proper baseline.
Detection of DNS queries with high levels of entropy (indication of DGA)
Here is the other statistical query from Steve’s research paper I utilized:
The purpose of this query is to show you a breakdown count of DNS queries by type (e.g. A, AAAA, NULL, TXT, etc.) for a particular time frame. The goal would be to detect malicious DNS and/or DNS tunneling by noticing a large spike in uncommon DNS query type(s).
This query has the same problems as the previous query mentioned in that it assumes you have baselines established and you know what looks abnormal on your network. Without a baseline, you have no frame of reference to refer to. On the plus side however, these two queries can be used in conjunction pretty effectively.
As an example, let’s say you ran the previous query and you noticed a spike of DNS queries at 7pm; well after everyone normally leaves the office. You could then choose to run our query above with a focused timeframe of say, 6:50pm to 7:10pm to give you a break-down of DNS queries by type for that timeframe to see if anything stands out. A large number of MX DNS records when there is nobody around to be sending email would be pretty fishy and warrant further investigation, right?
Tuning for better results
Now that I’ve talked about some of the statistical analysis queries, let’s look at one of the queries that analyzes DNS queries and sorts them by an entropy threshold:
The goal of this query is to analyze DNS queries using the URL toolbox Shannon Entropy calculator to determine a given query’s entropy scoring. From there, we display results with an entropy score of 2.5 or higher.
The higher the entropy score, the more likely a given DNS domain was algorithmically generated. These computer-generated domain names are often referred to as Domain Generation Algorithms (DGAs), and some (a lot) of malware strains use DGAs as either a primary or fallback method for reaching an attacker-controlled command and control servers for the malware to receive instructions/commands.
With instant gratification and the bottom line being a huge driving force in IT departments worldwide, cloud services have become extremely popular in recent years. Most major websites utilize cloud services to some extent; if not to host something then for redundancy via CDNs (content delivery networks).
The problem we run into in attempting to hunt for malicious domain names by their Shannon Entropy score is that a lot of these cloud services are generating domain names that have a high entropy score as well. This causes the query above to end up with a massive amount of benign results.
In addition to DGAs being considered “the new normal” for cloud services, something the original research paper didn’t account for was the sheer volume of data that would need to be sifted through. The number of DNS queries a given network makes on a daily basis varies greatly on a number of factors, but to say that you’d have to sift through billions of DNS queries for a single 24 hour period is not at all outlandish. This is a huge amount of data for a single analyst to sift through.
The only viable solutions I can see to resolving this issue and making this query usable is to reduce the amount of data the query returns. You can do this by reducing the time frame for a query (e.g. looking at 8 hour chunks of DNS data as opposed to an entire 24 hours of DNS data at a time), and/or filtering what domains, subdomains or TLDs you do not want to see results from.
Here is a modified query you might want to consider instead:
The query above differs in that we have a portion in which we can provide a comma separated list of domains, TLDs, and/or substrings we can add to the query to filter what results we get back. Additionally, we use the dedup function to reduce the number of duplicate results we get back.
Let’s look at an example:
This querying is filtering out all results that end in .LOCAL, .localdomain, .corporate_domain, and queries containing “.cloudflare.”. This query can easily be modified to ONLY look at queries from a particular domain or contain a particular string by removing the “NOT” modifier from the query and inputting the domains/subdomains/substrings you want to search for.
While the modified queries and query ideas I have presented will result in some of the data not being analyzed, it makes the query usable, a bit more scalable, and allows for filtering out benign data that analysts aren’t interested in. Given the choice between a query that returns unmanageable amounts of data, and one that returns something analysts can sift through in a more reasonable amount of time, I decided that the lesser of two evils is usable data. This leads us to our next query, which is somewhat related.
Detection of DNS queries with abnormally long length (3 or more times the average length)
The purpose of this query is to collect DNS queries, evaluate the length of the query, determine the average length of a query, and return any results that are 3 or more times larger than the standard deviation (e.g. three times larger than the average DNS query). Much like the previous query measuring the entropy of DNS queries, this Splunk query suffers from the fact obnoxiously long domain names for cloud services and CDNs is the new normal and having to sift through the sheer volume of data returned because of this being the new normal for internet traffic.
Much like the query before it, while it has weaknesses due to the make-up of internet traffic today looking like a complete and utter mess by default (old man shaking fist at cloud), you can limit the scope of the mess you are looking at, by reducing the timeframe in which you are searching for data and/or adding in a filter clause like so:
Tuning for better results
This allows you to filter out particular domains, subdomains, and/or TLDs from the query. Additionally, just like previous query, we can invert the logic by removing the “NOT” operator and choose to just focus on domains containing a string and/or from a given TLD.
That brings us to a final query that I’ve modified.
Destination DNS servers in the “outside” firewall zone
The purpose of this query is to gather a collection of IP addresses in which your firewalls have allowed traffic on port 53, TCP or UDP (remember that DNS uses both). We want to sort out these IP addresses, count out how many times the firewall has allowed communication to a given IP address, and show me all results greater than 100.
Due to the nature of DNS tunneling, you’ll easily see hundreds or thousands of DNS requests for a DNS tunneling endpoint.
Tuning for better results
Of course, you’ll probably see hundreds or thousands of hits for DNS requests to your company’s primary DNS resolvers, and if your firewall zones are… unique in that the “outside” firewall zone (or the name of the firewall zone for your network’s perimeter) includes DNS traffic to internal IP addresses.
This can be resolved by modifying the query ever so slightly:
You may also want to change the portion “count>100” to a larger number, depending on how large your network is to filter out some of the results. This list of IP addresses could be exported and fed to DNS enumeration tools and/or threat intelligence sources to determine whether or not the communication was potentially malicious.
Until next time
This is all I have for now. I’d like to offer my thanks to SANS and Steve Jaworski for writing, producing, and hosting the original work my research in this post was based off of, and making that report freely available for security analysts everywhere.