Three Years of Publishing Malware Traffic Datasets

The Stratosphere IPS is a behavioral-based intrusion detection and prevention system. It uses machine learning algorithms to detect malicious behaviors. In order to do that, we create models based on real malware behaviours to ensure a good accuracy and performance of our IPS. For this reason, in 2015 we started our sister project called 'Malware Capture Facility Project'. 

Malware Capture Facility Project (MCFP)

The MCFP is a project created in 2015 at the Czech Technical University AIC Group and is still ongoing. The goal of this project is simple: to capture real long-term malware traffic and make the captured data public for everyone to use.

It's been already three years since the MCFP project started! During this time, we have captured and published 369 malware traffic captures.

Challenges along the way

There are some inherent challenges or things to consider when it comes to creating datasets of this particular nature. Here are some we've encountered on this last years:

  • The relevance of the datasets fades over time: it doesn't matter how good the dataset is, how useful a malware traffic dataset is will decrease over time. We put a lot of effort on communicating the existence of new datasets so people can use them while they have the most value. Other than that, we don't have influence on this issue.
  • Diversity on the malware: it is extremely challenging to choose what to execute. We want variety, but we want to also keep capturing new versions of certain malware families. How much malware of the same family do we execute? How often do we execute malware of the same family? How do we pick what is new and worth executing? We deal with these questions on a daily basis.
  • Analysis accompanying the malware captures: executing malware is (relatively) fast, but we always wanted to provide also some analysis on them. With the amount of captures being generated this proves quite difficult. 

Impact

The malware datasets have generated by MCFP and StratosphereIPS have been used in many academic works, which give us a lot of pleasure! Here are some works that have cited the datasets, specially, CTU-13 dataset: