Dataset

The Stratosphere IPS Project has a sister project called the Malware Capture Facility Project that is responsible for making the long-term captures. This project is continually obtaining malware and normal data to feed the Stratosphere IPS.

Why we capture Normal Data

Machine learning algorithms need to be verified to find out their precise performance in real data. Specially in network computer security it is really important to have good datasets, because the data in the networks is infinite, changing, varied and with a high concept drift. These issues force us to obtain good datasets to train, verify and test the algorithms.

To make a good verification we need three types of traffic: Malware, Normal and Background. The Malware traffic will include all the things we want to detect, specially C&C (Command and Control) connections. The Normal traffic is very important to find out the real performance of our algorithms by computing the False Positives and True Negatives. The Background traffic is necessary to saturate the algorithms, verify its memory/speed performance and to test if the algorithm gets confused with the data.

Special Dataset CTU-13

The CTU-13 dataset consist in a group of 13 different malware captures done in a real network environment. The captures include Botnet, Normal and Background traffic. The Botnet traffic comes from the infected hosts, the Normal traffic from the verified normal hosts and the Background traffic is all the rest of traffic that we don’t know what it is for sure. The dataset is labeled in a flow by flow basis, consisting in one of the largest and more labeled botnet datasets available. The files that can be downloaded are:

  • Binetflow files
    • For Botnet, Normal and Background traffic.
    • Text files with bidirectional flows generated by Argus.
  • Biargus files
    • For Botnet, Normal and Background traffic.
    • Binary files with bidirectional flows generated by Argus.
  • Complete Pcap files
    • For Botnet traffic.
    • Pcap files with all the payload data.
  • Truncated Pcap files
    • For Botnet, Normal and Background traffic.
    • Pcap files only with the headers information.

The CTU-13 dataset is published with the licence Creative Commons CC-BY.

The dataset CTU-13 can be downloaded from the following link:

  • CTU-13-Dataset (Large dataset of 13 captures with Malware, Normal and Background traffic)

An explanation of its characteristics can be read here

Backup Site for the CTU-13 dataset

In case our main repository of files is not working, you can still find the files of the CTU-13 dataset here.

Datasets

The password of all the zip files with malware is: infected

Mixed Captures


Written by Sebastian Garcia in Dataset on Thu 09 April 2015. Tags: dataset, malware, botnet, normal, captures,