The Stratosphere IPS is a behavioral-based intrusion detection and prevention system. It uses machine learning algorithms to detect malicious behaviors. In order to do that, we create models based on real malware behaviours to ensure a good accuracy and performance of our IPS. For this reason, in 2015 we started our sister project called 'Malware Capture Facility Project'.
Malware Capture Facility Project (MCFP)
The MCFP is a project created in 2015 at the Czech Technical University AIC Group and is still ongoing. The goal of this project is simple: to capture real long-term malware traffic and make the captured data public for everyone to use.
It's been already three years since the MCFP project started! During this time, we have captured and published 369 malware traffic captures.
Challenges along the way
There are some inherent challenges or things to consider when it comes to creating datasets of this particular nature. Here are some we've encountered on this last years:
- The relevance of the datasets fades over time: it doesn't matter how good the dataset is, how useful a malware traffic dataset is will decrease over time. We put a lot of effort on communicating the existence of new datasets so people can use them while they have the most value. Other than that, we don't have influence on this issue.
- Diversity on the malware: it is extremely challenging to choose what to execute. We want variety, but we want to also keep capturing new versions of certain malware families. How much malware of the same family do we execute? How often do we execute malware of the same family? How do we pick what is new and worth executing? We deal with these questions on a daily basis.
- Analysis accompanying the malware captures: executing malware is (relatively) fast, but we always wanted to provide also some analysis on them. With the amount of captures being generated this proves quite difficult.
Impact
The malware datasets have generated by MCFP and StratosphereIPS have been used in many academic works, which give us a lot of pleasure! Here are some works that have cited the datasets, specially, CTU-13 dataset:
- F. Haddadi, D. T. Phan and A. N. Zincir-Heywood, "How to choose from different botnet detection systems?," NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium, Istanbul, 2016, pp. 1079-1084. doi: 10.1109/NOMS.2016.7502964
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7502964&isnumber=7502779 - Guerra, J., & Catania, C. (2017). Improving the Generation of Labeled Network Traffic Datasets Through Machine Learning Techniques. In XXIII Congreso Argentino de Ciencias de la Computación (La Plata, 2017).
URL: http://sedici.unlp.edu.ar/handle/10915/63933 - Kalaivani, P., & Vijaya, M. S. Mining Based Detection of botnet traffic in Network Flow. IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol.6, No1, Jan-Feb 2016
URL: https://ijcsits.org/papers/vol6no12016/16vol6no1.pdf - Khanchi, S., Vahdat, A., Heywood, M. I., & Zincir-Heywood, A. N. (2017). On botnet detection with genetic programming under streaming data label budgets and class imbalance. Swarm and Evolutionary Computation.
URL: https://www.sciencedirect.com/science/article/abs/pii/S2210650216304473