How to Use the 2019 CPTC Security Dataset in Splunk

By |Published On: January 13th, 2020|

In my previous blog post series, I’ve shared my experiences helping lead the National Collegiate Penetration Testing Competition (National CPTC). One of the most exciting contributions of this event–beyond the real-world experiences of the competitors–is the data that we make available for research.

Since we’re focused on Splunk here at Hurricane Labs, we leverage this tool to collect as much data as we can to provide an insight into what teams are doing. Now we’re making all of the data we’ve collected available so that you can do interesting things with it!

For more information about the 2019 CPTC competition environment, please review the following: CPTC Review Part 1 and CPTC 2019–Finals Review.

This data is made available as frozen Splunk buckets, which can be imported into a Splunk instance and searched. We’ve tried to break this data up by log type wherever possible to help you pinpoint the data you’re interested in exploring without needing to download unnecessary data.

Getting the Data

We’ve made the dataset available on RIT’s mirrors, http://mirrors.rit.edu/cptc/2019/. This data is generally made available for both the regional and national events, and is organized by the type of data contained in the buckets (the Splunk sourcetype).

The data is organized into two sets, Regionals and Nationals. Each directory contains a number of tarballs, each associated with an index in the CPTC Splunk environment. See the list at the bottom of this post to get an idea of what sourcetypes are included in each index.

Using the Data

Once you’ve identified the indexes you would like to download, you will need to import these into your own Splunk instance for searching. This import procedure is referred to as “thawing” the frozen buckets.

For this example, I will use the linux_bashhistory index from the National competition. This is a good index to start with, since it’s quite small in size.

Start by installing Splunk on a Linux machine or using an existing Splunk instance. If you’re looking for a little extra help with this, you can check out the quick tutorial I’ve built out that walks you through the steps of installing Splunk on a Linux VM.

You can also deploy one of Splunk’s docker images or install Splunk on your PC or Mac, but the following steps will likely need some modification to work in those cases.

Create an Index

In your Splunk instance, create an index. This can be done in the webui by going to Settings -> Indexes -> New Index. Be sure that the index path specified has enough space to store the data you will be thawing. In this example, I’ll use the defaults, which will result in the thawed path being located at $SPLUNK_DB/INDEX_NAME/thaweddb. On a typical Splunk installation on Linux for this data, this will correspond to /opt/splunk/var/lib/splunk/linux_bashhistory/thaweddb.

Place the Data in the Thawed Directory

First, transfer the .tgz file downloaded from mirrors.rit.edu onto your Splunk system and untar it. Each of these files will untar into a folder with the index name:

Copy to Clipboard

You can use the tar -zxvf linux_hashistory.tgz command to untar this file from the current directory. Once this is done, you’ll see the following output:

Copy to Clipboard

The db_ and rb_ directories represent the individual buckets containing data that will need to be thawed. You will want to move all of these into the thawed path you defined earlier when you created the index in Splunk.

Copy to Clipboard

Thawing the Data

Next, we need to tell Splunk to re-index, or thaw, the data you just copied over–this is not an automatic process. Since there are multiple buckets, it is best to handle this with a script, such as the one available here on Splunk Answers.

I’ve copied this script below for reference–note that the index name can be specified explicitly in the _thawedDBPath on line 12 if you’re only thawing a single index.

Copy to Clipboard

Once this script is in place, you can run it:

Copy to Clipboard

Once this process is finished, you must restart Splunk in order for the data to be searchable:

Copy to Clipboard

Searching the Data

Once this process is completed and Splunk has been restarted, your data will now be searchable in Splunk. Be sure to pick a time range that covers the periods associated with the event. For this example, searching All time works well since there are not too many events.

Splunk Apps

In order to get the most value from this Splunk data, you may need to install Splunk apps into your Splunk instance. Both the Windows TA and Linux TA will be helpful to have installed to better work with this data. Other data sources will require other apps to have working field extractions.

How the Data is Structured

In the National CPTC events, each team is provided with an identical, dedicated environment. During the 2019 season, these were separate instances in Google Cloud.

Team Assignments

Teams compete in two events, regionals (October 12 & 13, 2019) and nationals (November 22-24, 2019). There were six regional events: North Eastern, South Eastern, New England, Central, Western, and International. Each regional event had up to 10 teams. The winning teams from each region (6 total) were invited to compete in the National competition, along with the next 4 highest-ranked teams at large.

Event Ranges

Events were exported for the time periods of the regional and national competitions. The following epoch timestamps were used for the data export:

Nationals:

  • _earliestTime=1574463600
  • _latestTime=1574582400

Regionals:

  • _earliestTime=1570842000
  • _latestTime=1570964400

Because of limitations in how this data can be exported when frozen by Splunk, you may see events that occur outside of these time windows in the data. These should be ignored as they are not necessarily complete.

Hostnames

All hostnames have a prefix and team number. The prefix represents the region, and the team number represents an individual team in that region. A few examples of hostnames:

  • nationals-t2-vdi-kali01: A Linux VDI instance for Team 2 in the Nationals event
  • newengland-t6-vdi-kali01: A Linux VDI instance for Team 6 in the New England regional
  • nationals-t1-bank-core-01: The DinoBank core application host for Nationals Team 1

The host field in Splunk is set when the data is collected, based on the region and team number. There is no local indication on the hosts themselves about their region. This means that a host showing up as host=nationals-t1-bank-core-01 in Splunk is actually running with the host name on the local operating system set to bank-core-01.

This naming structure allows you to search the data in various ways, depending on if you are looking for everything from a certain region, team, or type of host. Here are some sample searches you may want to try:

  • host=nationals-t* – data from all teams that participated in Nationals
  • host=nationals-t1-* – data from all of the hosts for Nationals Team 1 (note the trailing dash, t1* would include both Team 1 and Team 10)
  • host=*kali0* – data from all the Kali VDI instances for all teams (note, Splunk searches with a leading * are not very efficient, so avoid them if possible)

You may also note other hosts that do not follow these naming conventions. Many of these represent build and test environments which were not used by students. The following prefixes are the only ones that will contain student competitor data:

Regional host prefixes:

  • northeastern
  • southeastern
  • newengland
  • central
  • western
  • international

National event prefixes:

  • nationals

Teams numbers range from 0-10. For regionals, not all regions have 10 teams.

What We Collected

We attempted to collect data from as many locations as possible across all of the student systems (both VDI/sources and targets) throughout the environment. The types of data available will vary based on the operating systems and roles of each host.

OS-Independent Sources

The following logs are generally available for both Windows and Linux hosts:

  • IDS events (ids index): suricata:http, suricata:stats, suricata:alert, suricata:tls, suricata:ssh
  • Splunk App for Stream events (stream index): stream:dns, stream:udp, stream:tcp, stream:http

Windows Sources

Nearly all data types available for collection by the Splunk Add-on for Microsoft Windows (https://splunkbase.splunk.com/app/742/) were collected, with intervals increased to provide an increased sampling rate. All of the Windows specific indexes are prefixed with index=win*, and Windows Event Log data sources log to index=winevent*.

The Windows data available generally includes the following:

  • Windows Perfmon data
  • Windows Event Log (Security, System, Application, etc.)
  • WinNetMon
  • WinHostMon
  • WinRegistry
  • Windows Sysmon
  • Powershell Transcript Logs

Linux Sources

Nearly all data types available for collection by the Splunk Add-on for Unix and Linux were collected, with intervals increased to provide an increased sampling rate. All of the Linux specific indexes are prefixed with index=linux*.

The Linux data available generally includes the following:

  • Bash history
  • Common diagnostic tools: df, ps, top, lsof, netstat, vmstat, who
  • Open ports
  • Network and interface information
  • Package information
  • Contents of the /etc and /tmp directories

Sourcetypes by Index

The following sourcetypes are available in each of the indexes available for download in this dataset:

index Sourcetype
ids suricata:alert suricata:http suricata:smtp suricata:ssh suricata:stats suricata:tls
linux_bandwidth bandwidth
linux_bashhistory bash_history
linux_cpu cpu
linux_df df
linux_fim linux:fim:etc linux:fim:tmp
linux_hardware hardware
linux_interfaces interfaces
linux_iostat iostat
linux_last lastlog
linux_lsof lsof
linux_netstat netstat
linux_packages package
linux_passwd Unix:UserAccounts
linux_ports Unix:ListeningPorts openPorts
linux_protocol protocol
linux_ps ps
linux_services Unix:Service
linux_sshd Unix:SSHDConfig
linux_time time
linux_top top
linux_uptime Unix:Uptime
linux_version Unix:Version
linux_vmstat vmstat
linux_vsftp Unix:VSFTPDConfig
linux_who usersWithLoginPrivs who
stream stream:dns stream:http stream:tcp stream:udp
summary stash
threat_activity stash
windows_activedirectory MSAD:NT6:Health MSAD:NT6:Replication MSAD:NT6:SiteInfo Powershell:ScriptExecutionErrorRecord Powershell:ScriptExecutionSummary
windows_admon ActiveDirectory
windows_cmdhist XmlWinEventLog powershell:transcript
windows_hostmon WinHostMon
windows_network WinNetMon
windows_perfmon PerfmonMk:CPU PerfmonMk:DFS_Replicated_Folders PerfmonMk:DNS PerfmonMk:LogicalDisk PerfmonMk:Memory PerfmonMk:NTDS PerfmonMk:Network PerfmonMk:Network_Interface PerfmonMk:PhysicalDisk PerfmonMk:Process PerfmonMk:Processor PerfmonMk:ProcessorInformation PerfmonMk:System
windows_print WinPrintMon
windows_regmon WinRegistry
windows_sysmon XmlWinEventLog:Microsoft-Windows-Sysmon/Operational
winevent_application WinEventLog
winevent_dfsreplication WinEventLog
winevent_directoryservice   WinEventLog
winevent_dns WinEventLog
winevent_security WinEventLog
winevent_system WinEventLog

Questions and Feedback

Please reach out to the CPTC research distribution list (research nationalcptc.org) for further information about this dataset.

Licensing

This dataset is being made freely available to support various educational and research initiatives. While you are free to use this data for your own purposes, we ask that this dataset be attributed to the National Collegiate Penetration Testing Competition (National CPTC) in any publications or references.

About Hurricane Labs

Hurricane Labs is a dynamic Managed Services Provider that unlocks the potential of Splunk and security for diverse enterprises across the United States. With a dedicated, Splunk-focused team and an emphasis on humanity and collaboration, we provide the skills, resources, and results to help make our customers’ lives easier.

For more information, visit www.hurricanelabs.com and follow us on Twitter @hurricanelabs.