How to Use the 2019 CPTC Security Dataset in Splunk

By Tom Kopchak|Published On: January 13th, 2020|

In my previous blog post series, I’ve shared my experiences helping lead the National Collegiate Penetration Testing Competition (National CPTC). One of the most exciting contributions of this event–beyond the real-world experiences of the competitors–is the data that we make available for research.

Since we’re focused on Splunk here at Hurricane Labs, we leverage this tool to collect as much data as we can to provide an insight into what teams are doing. Now we’re making all of the data we’ve collected available so that you can do interesting things with it!

For more information about the 2019 CPTC competition environment, please review the following: CPTC Review Part 1 and CPTC 2019–Finals Review.

This data is made available as frozen Splunk buckets, which can be imported into a Splunk instance and searched. We’ve tried to break this data up by log type wherever possible to help you pinpoint the data you’re interested in exploring without needing to download unnecessary data.

Getting the Data

We’ve made the dataset available on RIT’s mirrors. This data is generally made available for both the regional and national events, and is organized by the type of data contained in the buckets (the Splunk sourcetype).

The data is organized into two sets, Regionals and Nationals. Each directory contains a number of tarballs, each associated with an index in the CPTC Splunk environment. See the list at the bottom of this post to get an idea of what sourcetypes are included in each index.

Using the Data

Once you’ve identified the indexes you would like to download, you will need to import these into your own Splunk instance for searching. This import procedure is referred to as “thawing” the frozen buckets.

For this example, I will use the linux_bashhistory index from the National competition. This is a good index to start with, since it’s quite small in size.

Start by installing Splunk on a Linux machine or using an existing Splunk instance. If you’re looking for a little extra help with this, you can check out the quick tutorial I’ve built out that walks you through the steps of installing Splunk on a Linux VM.

You can also deploy one of Splunk’s docker images or install Splunk on your PC or Mac, but the following steps will likely need some modification to work in those cases.

Create an Index

In your Splunk instance, create an index. This can be done in the webui by going to Settings -> Indexes -> New Index. Be sure that the index path specified has enough space to store the data you will be thawing. In this example, I’ll use the defaults, which will result in the thawed path being located at $SPLUNK_DB/INDEX_NAME/thaweddb. On a typical Splunk installation on Linux for this data, this will correspond to /opt/splunk/var/lib/splunk/linux_bashhistory/thaweddb.

Place the Data in the Thawed Directory

First, transfer the .tgz file downloaded from mirrors.rit.edu onto your Splunk system and untar it. Each of these files will untar into a folder with the index name:

Copy to Clipboard

You can use the tar -zxvf linux_hashistory.tgz command to untar this file from the current directory. Once this is done, you’ll see the following output:

Copy to Clipboard

The db_ and rb_ directories represent the individual buckets containing data that will need to be thawed. You will want to move all of these into the thawed path you defined earlier when you created the index in Splunk.

Copy to Clipboard

Thawing the Data

Next, we need to tell Splunk to re-index, or thaw, the data you just copied over–this is not an automatic process. Since there are multiple buckets, it is best to handle this with a script, such as the one available here on Splunk Answers.

I’ve copied this script below for reference–note that the index name can be specified explicitly in the _thawedDBPath on line 12 if you’re only thawing a single index.

Copy to Clipboard

#!/usr/bin/env bash
 
# initialize variables
# --------------------
# Set to the number of buckets you want to rebuild concurrently
_maxChildProcs=10
 
# Set to the same path as $SPLUNK_HOME ($SPLUNK_HOME/bin/splunk should be a valid path to the Splunk CLI)
_splunkHome="/opt/splunk"
 
# You must use a wildcard for the index name if you wish to restore buckets for multiple indexes
_thawedDBPath="/opt/splunk/var/lib/splunk/*/thaweddb"
 
# Set the earliest time (epoch) for buckets you would like to restore, nothing that occurs previous to this time will be rebuilt
# Note: if set to null no earliest time will be enforced
_earliestTime=
 
# Set the latest time (epoch) for bucket you would like to restore, nothing that occurs after this time will be rebuilt
# Note: if set to null no latest time will be enforced
_latestTime=
 
if [ -z "$_earliestTime" ] && [ -z "$_latestTime" ]; then
  # Rebuild indexes for all thawed buckets
  _bucketListCmd="ls -d ${_thawedDBPath}/*"
elif [ -n "$_earliestTime" ] && [ -n "$_latestTime" ]; then
  # Rebuild indexes for all thawed buckets after _earliestTime and before _latestTime
  _bucketListCmd="ls -d ${_thawedDBPath}/* | awk -v et=$_earliestTime, lt=$_latestTime 'BEGIN {FS = \"_\"} $2 >= et && $3 <= lt {print $1\"_\"$2\"_\"$3\"_\"$4}'"
elif [ -n "$_earliestTime" ]; then
  # Rebuild indexes for all thawed buckets that occur after earliestTime
  _bucketListCmd="ls -d ${_thawedDBPath}/* | awk -v et=$_earliestTime 'BEGIN {FS = \"_\"} $2 >= et {print $1\"_\"$2\"_\"$3\"_\"$4}'"
elif [ -n "$_latestTime" ]; then
  # Rebuild indexes for all thawed buckets that occur before latestTime
  _bucketListCmd="ls -d ${_thawedDBPath}/* | awk -v lt=$_latestTime 'BEGIN {FS = \"_\"} $3 <= lt {print $1\"_\"$2\"_\"$3\"_\"$4}'"
fi
 
# Spawn child processes to perform index restore for specified thawed buckets
# ---------------------------------------------------------------------------
while read -r _thawedBucket; do
  # Rebuld bucket if it hasn't been rebuilt already
  if ls $_thawedBucket/*.tsidx 1> /dev/null 2>&1; then
    echo "Already rebuilt: $_thawedBucket"
  else
    if [ $_maxChildProcs -gt 1 ]; then
      echo "Rebuilding bucket: $_thawedBucket..."
      ${_splunkHome}/bin/splunk rebuild $_thawedBucket &
    else
      echo "Rebuilding bucket: $_thawedBucket..."
      ${_splunkHome}/bin/splunk rebuild $_thawedBucket
    fi
  fi
 
  # throttle child process count
  # ----------------------------
  _childProcCount=$(ps -ef | awk '{print $3}' | grep "$$" | awk 'END {print FNR}')
  while [ "$_childProcCount" -gt "$_maxChildProcs" ]; do
    sleep 0.5
    _childProcCount=$(ps -ef | awk '{print $3}' | grep "$$" | awk 'END {print FNR}')
  done
done < <($_bucketListCmd)

Once this script is in place, you can run it:

Copy to Clipboard

Once this process is finished, you must restart Splunk in order for the data to be searchable:

Copy to Clipboard

Searching the Data

Once this process is completed and Splunk has been restarted, your data will now be searchable in Splunk. Be sure to pick a time range that covers the periods associated with the event. For this example, searching All time works well since there are not too many events.

Splunk Apps

In order to get the most value from this Splunk data, you may need to install Splunk apps into your Splunk instance. Both the Windows TA and Linux TA will be helpful to have installed to better work with this data. Other data sources will require other apps to have working field extractions.

How the Data is Structured

In the National CPTC events, each team is provided with an identical, dedicated environment. During the 2019 season, these were separate instances in Google Cloud.

Team Assignments

Teams compete in two events, regionals (October 12 & 13, 2019) and nationals (November 22-24, 2019). There were six regional events: North Eastern, South Eastern, New England, Central, Western, and International. Each regional event had up to 10 teams. The winning teams from each region (6 total) were invited to compete in the National competition, along with the next 4 highest-ranked teams at large.

Event Ranges

Events were exported for the time periods of the regional and national competitions. The following epoch timestamps were used for the data export:

Nationals:

_earliestTime=1574463600
_latestTime=1574582400

Regionals:

_earliestTime=1570842000
_latestTime=1570964400

Because of limitations in how this data can be exported when frozen by Splunk, you may see events that occur outside of these time windows in the data. These should be ignored as they are not necessarily complete.

Hostnames

All hostnames have a prefix and team number. The prefix represents the region, and the team number represents an individual team in that region. A few examples of hostnames:

nationals-t2-vdi-kali01: A Linux VDI instance for Team 2 in the Nationals event
newengland-t6-vdi-kali01: A Linux VDI instance for Team 6 in the New England regional
nationals-t1-bank-core-01: The DinoBank core application host for Nationals Team 1

The host field in Splunk is set when the data is collected, based on the region and team number. There is no local indication on the hosts themselves about their region. This means that a host showing up as host=nationals-t1-bank-core-01 in Splunk is actually running with the host name on the local operating system set to bank-core-01.

This naming structure allows you to search the data in various ways, depending on if you are looking for everything from a certain region, team, or type of host. Here are some sample searches you may want to try:

host=nationals-t* – data from all teams that participated in Nationals
host=nationals-t1-* – data from all of the hosts for Nationals Team 1 (note the trailing dash, t1* would include both Team 1 and Team 10)
host=*kali0* – data from all the Kali VDI instances for all teams (note, Splunk searches with a leading * are not very efficient, so avoid them if possible)

You may also note other hosts that do not follow these naming conventions. Many of these represent build and test environments which were not used by students. The following prefixes are the only ones that will contain student competitor data:

Regional host prefixes:

northeastern
southeastern
newengland
central
western
international

National event prefixes:

nationals

Teams numbers range from 0-10. For regionals, not all regions have 10 teams.

What We Collected

We attempted to collect data from as many locations as possible across all of the student systems (both VDI/sources and targets) throughout the environment. The types of data available will vary based on the operating systems and roles of each host.

OS-Independent Sources

The following logs are generally available for both Windows and Linux hosts:

IDS events (ids index): suricata:http, suricata:stats, suricata:alert, suricata:tls, suricata:ssh
Splunk App for Stream events (stream index): stream:dns, stream:udp, stream:tcp, stream:http

Windows Sources

Nearly all data types available for collection by the Splunk Add-on for Microsoft Windows were collected, with intervals increased to provide an increased sampling rate. All of the Windows specific indexes are prefixed with index=win*, and Windows Event Log data sources log to index=winevent*.

The Windows data available generally includes the following:

Windows Perfmon data
Windows Event Log (Security, System, Application, etc.)
WinNetMon
WinHostMon
WinRegistry
Windows Sysmon
Powershell Transcript Logs

Linux Sources

Nearly all data types available for collection by the Splunk Add-on for Unix and Linux were collected, with intervals increased to provide an increased sampling rate. All of the Linux specific indexes are prefixed with index=linux*.

The Linux data available generally includes the following:

Bash history
Common diagnostic tools: df, ps, top, lsof, netstat, vmstat, who
Open ports
Network and interface information
Package information
Contents of the /etc and /tmp directories

Sourcetypes by Index

The following sourcetypes are available in each of the indexes available for download in this dataset:

index	Sourcetype
ids	suricata:alert suricata:http suricata:smtp suricata:ssh suricata:stats suricata:tls
linux_bandwidth	bandwidth
linux_bashhistory	bash_history
linux_cpu	cpu
linux_df	df
linux_fim	linux:fim:etc linux:fim:tmp
linux_hardware	hardware
linux_interfaces	interfaces
linux_iostat	iostat
linux_last	lastlog
linux_lsof	lsof
linux_netstat	netstat
linux_packages	package
linux_passwd	Unix:UserAccounts
linux_ports	Unix:ListeningPorts openPorts
linux_protocol	protocol
linux_ps	ps
linux_services	Unix:Service
linux_sshd	Unix:SSHDConfig
linux_time	time
linux_top	top
linux_uptime	Unix:Uptime
linux_version	Unix:Version
linux_vmstat	vmstat
linux_vsftp	Unix:VSFTPDConfig
linux_who	usersWithLoginPrivs who
stream	stream:dns stream:http stream:tcp stream:udp
summary	stash
threat_activity	stash
windows_activedirectory	MSAD:NT6:Health MSAD:NT6:Replication MSAD:NT6:SiteInfo Powershell:ScriptExecutionErrorRecord Powershell:ScriptExecutionSummary
windows_admon	ActiveDirectory
windows_cmdhist	XmlWinEventLog powershell:transcript
windows_hostmon	WinHostMon
windows_network	WinNetMon
windows_perfmon	PerfmonMk:CPU PerfmonMk:DFS_Replicated_Folders PerfmonMk:DNS PerfmonMk:LogicalDisk PerfmonMk:Memory PerfmonMk:NTDS PerfmonMk:Network PerfmonMk:Network_Interface PerfmonMk:PhysicalDisk PerfmonMk:Process PerfmonMk:Processor PerfmonMk:ProcessorInformation PerfmonMk:System
windows_print	WinPrintMon
windows_regmon	WinRegistry
windows_sysmon	XmlWinEventLog:Microsoft-Windows-Sysmon/Operational
winevent_application	WinEventLog
winevent_dfsreplication	WinEventLog
winevent_directoryservice	WinEventLog
winevent_dns	WinEventLog
winevent_security	WinEventLog
winevent_system	WinEventLog

Questions and Feedback

Please reach out to the CPTC research distribution list (research nationalcptc.org) for further information about this dataset.

Licensing

This dataset is being made freely available to support various educational and research initiatives. While you are free to use this data for your own purposes, we ask that this dataset be attributed to the National Collegiate Penetration Testing Competition (National CPTC) in any publications or references.

About Hurricane Labs

Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.

For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.