How to Use the 2019 CPTC Security Dataset in Splunk
In my previous blog post series, I’ve shared my experiences helping lead the National Collegiate Penetration Testing Competition (National CPTC). One of the most exciting contributions of this event–beyond the real-world experiences of the competitors–is the data that we make available for research.
Since we’re focused on Splunk here at Hurricane Labs, we leverage this tool to collect as much data as we can to provide an insight into what teams are doing. Now we’re making all of the data we’ve collected available so that you can do interesting things with it!
For more information about the 2019 CPTC competition environment, please review the following: CPTC Review Part 1 and CPTC 2019–Finals Review.
This data is made available as frozen Splunk buckets, which can be imported into a Splunk instance and searched. We’ve tried to break this data up by log type wherever possible to help you pinpoint the data you’re interested in exploring without needing to download unnecessary data.
Getting the Data
We’ve made the dataset available on RIT’s mirrors, http://mirrors.rit.edu/cptc/2019/. This data is generally made available for both the regional and national events, and is organized by the type of data contained in the buckets (the Splunk sourcetype).
The data is organized into two sets, Regionals and Nationals. Each directory contains a number of tarballs, each associated with an index in the CPTC Splunk environment. See the list at the bottom of this post to get an idea of what sourcetypes are included in each index.
Using the Data
Once you’ve identified the indexes you would like to download, you will need to import these into your own Splunk instance for searching. This import procedure is referred to as “thawing” the frozen buckets.
For this example, I will use the linux_bashhistory index from the National competition. This is a good index to start with, since it’s quite small in size.
Start by installing Splunk on a Linux machine or using an existing Splunk instance. If you’re looking for a little extra help with this, you can check out the quick tutorial I’ve built out that walks you through the steps of installing Splunk on a Linux VM.
You can also deploy one of Splunk’s docker images or install Splunk on your PC or Mac, but the following steps will likely need some modification to work in those cases.
Create an Index
In your Splunk instance, create an index. This can be done in the webui by going to Settings -> Indexes -> New Index. Be sure that the index path specified has enough space to store the data you will be thawing. In this example, I’ll use the defaults, which will result in the thawed path being located at $SPLUNK_DB/INDEX_NAME/thaweddb. On a typical Splunk installation on Linux for this data, this will correspond to /opt/splunk/var/lib/splunk/linux_bashhistory/thaweddb.

Place the Data in the Thawed Directory
First, transfer the .tgz file downloaded from mirrors.rit.edu onto your Splunk system and untar it. Each of these files will untar into a folder with the index name:
You can use the tar -zxvf linux_hashistory.tgz command to untar this file from the current directory. Once this is done, you’ll see the following output:
The db_ and rb_ directories represent the individual buckets containing data that will need to be thawed. You will want to move all of these into the thawed path you defined earlier when you created the index in Splunk.
Thawing the Data
Next, we need to tell Splunk to re-index, or thaw, the data you just copied over–this is not an automatic process. Since there are multiple buckets, it is best to handle this with a script, such as the one available here on Splunk Answers.
I’ve copied this script below for reference–note that the index name can be specified explicitly in the _thawedDBPath on line 12 if you’re only thawing a single index.
Once this script is in place, you can run it:
Once this process is finished, you must restart Splunk in order for the data to be searchable:
Searching the Data
Once this process is completed and Splunk has been restarted, your data will now be searchable in Splunk. Be sure to pick a time range that covers the periods associated with the event. For this example, searching All time works well since there are not too many events.

Splunk Apps
In order to get the most value from this Splunk data, you may need to install Splunk apps into your Splunk instance. Both the Windows TA and Linux TA will be helpful to have installed to better work with this data. Other data sources will require other apps to have working field extractions.
How the Data is Structured
In the National CPTC events, each team is provided with an identical, dedicated environment. During the 2019 season, these were separate instances in Google Cloud.
Team Assignments
Teams compete in two events, regionals (October 12 & 13, 2019) and nationals (November 22-24, 2019). There were six regional events: North Eastern, South Eastern, New England, Central, Western, and International. Each regional event had up to 10 teams. The winning teams from each region (6 total) were invited to compete in the National competition, along with the next 4 highest-ranked teams at large.
Event Ranges
Events were exported for the time periods of the regional and national competitions. The following epoch timestamps were used for the data export:
Nationals:
- _earliestTime=1574463600
- _latestTime=1574582400
Regionals:
- _earliestTime=1570842000
- _latestTime=1570964400
Because of limitations in how this data can be exported when frozen by Splunk, you may see events that occur outside of these time windows in the data. These should be ignored as they are not necessarily complete.
Hostnames
All hostnames have a prefix and team number. The prefix represents the region, and the team number represents an individual team in that region. A few examples of hostnames:
- nationals-t2-vdi-kali01: A Linux VDI instance for Team 2 in the Nationals event
- newengland-t6-vdi-kali01: A Linux VDI instance for Team 6 in the New England regional
- nationals-t1-bank-core-01: The DinoBank core application host for Nationals Team 1
The host field in Splunk is set when the data is collected, based on the region and team number. There is no local indication on the hosts themselves about their region. This means that a host showing up as host=nationals-t1-bank-core-01 in Splunk is actually running with the host name on the local operating system set to bank-core-01.
This naming structure allows you to search the data in various ways, depending on if you are looking for everything from a certain region, team, or type of host. Here are some sample searches you may want to try:
- host=nationals-t* – data from all teams that participated in Nationals
- host=nationals-t1-* – data from all of the hosts for Nationals Team 1 (note the trailing dash, t1* would include both Team 1 and Team 10)
- host=*kali0* – data from all the Kali VDI instances for all teams (note, Splunk searches with a leading * are not very efficient, so avoid them if possible)
You may also note other hosts that do not follow these naming conventions. Many of these represent build and test environments which were not used by students. The following prefixes are the only ones that will contain student competitor data:
Regional host prefixes:
- northeastern
- southeastern
- newengland
- central
- western
- international
National event prefixes:
- nationals
Teams numbers range from 0-10. For regionals, not all regions have 10 teams.
What We Collected
We attempted to collect data from as many locations as possible across all of the student systems (both VDI/sources and targets) throughout the environment. The types of data available will vary based on the operating systems and roles of each host.
OS-Independent Sources
The following logs are generally available for both Windows and Linux hosts:
- IDS events (ids index): suricata:http, suricata:stats, suricata:alert, suricata:tls, suricata:ssh
- Splunk App for Stream events (stream index): stream:dns, stream:udp, stream:tcp, stream:http
Windows Sources
Nearly all data types available for collection by the Splunk Add-on for Microsoft Windows (https://splunkbase.splunk.com/app/742/) were collected, with intervals increased to provide an increased sampling rate. All of the Windows specific indexes are prefixed with index=win*, and Windows Event Log data sources log to index=winevent*.
The Windows data available generally includes the following:
- Windows Perfmon data
- Windows Event Log (Security, System, Application, etc.)
- WinNetMon
- WinHostMon
- WinRegistry
- Windows Sysmon
- Powershell Transcript Logs
Linux Sources
Nearly all data types available for collection by the Splunk Add-on for Unix and Linux were collected, with intervals increased to provide an increased sampling rate. All of the Linux specific indexes are prefixed with index=linux*.
The Linux data available generally includes the following:
- Bash history
- Common diagnostic tools: df, ps, top, lsof, netstat, vmstat, who
- Open ports
- Network and interface information
- Package information
- Contents of the /etc and /tmp directories
Sourcetypes by Index
The following sourcetypes are available in each of the indexes available for download in this dataset:
index | Sourcetype |
ids | suricata:alert suricata:http suricata:smtp suricata:ssh suricata:stats suricata:tls |
linux_bandwidth | bandwidth |
linux_bashhistory | bash_history |
linux_cpu | cpu |
linux_df | df |
linux_fim | linux:fim:etc linux:fim:tmp |
linux_hardware | hardware |
linux_interfaces | interfaces |
linux_iostat | iostat |
linux_last | lastlog |
linux_lsof | lsof |
linux_netstat | netstat |
linux_packages | package |
linux_passwd | Unix:UserAccounts |
linux_ports | Unix:ListeningPorts openPorts |
linux_protocol | protocol |
linux_ps | ps |
linux_services | Unix:Service |
linux_sshd | Unix:SSHDConfig |
linux_time | time |
linux_top | top |
linux_uptime | Unix:Uptime |
linux_version | Unix:Version |
linux_vmstat | vmstat |
linux_vsftp | Unix:VSFTPDConfig |
linux_who | usersWithLoginPrivs who |
stream | stream:dns stream:http stream:tcp stream:udp |
summary | stash |
threat_activity | stash |
windows_activedirectory | MSAD:NT6:Health MSAD:NT6:Replication MSAD:NT6:SiteInfo Powershell:ScriptExecutionErrorRecord Powershell:ScriptExecutionSummary |
windows_admon | ActiveDirectory |
windows_cmdhist | XmlWinEventLog powershell:transcript |
windows_hostmon | WinHostMon |
windows_network | WinNetMon |
windows_perfmon | PerfmonMk:CPU PerfmonMk:DFS_Replicated_Folders PerfmonMk:DNS PerfmonMk:LogicalDisk PerfmonMk:Memory PerfmonMk:NTDS PerfmonMk:Network PerfmonMk:Network_Interface PerfmonMk:PhysicalDisk PerfmonMk:Process PerfmonMk:Processor PerfmonMk:ProcessorInformation PerfmonMk:System |
windows_print | WinPrintMon |
windows_regmon | WinRegistry |
windows_sysmon | XmlWinEventLog:Microsoft-Windows-Sysmon/Operational |
winevent_application | WinEventLog |
winevent_dfsreplication | WinEventLog |
winevent_directoryservice | WinEventLog |
winevent_dns | WinEventLog |
winevent_security | WinEventLog |
winevent_system | WinEventLog |
Questions and Feedback
Please reach out to the CPTC research distribution list (research nationalcptc.org) for further information about this dataset.
Licensing
This dataset is being made freely available to support various educational and research initiatives. While you are free to use this data for your own purposes, we ask that this dataset be attributed to the National Collegiate Penetration Testing Competition (National CPTC) in any publications or references.
About Hurricane Labs
Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.
For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.
