With the increasing use of technology and the growing number of cyber-attacks, the need for robust and representative security datasets is crucial to learn how to create better tools to detect security attacks. While security datasets have been valuable in advancing cybersecurity research, most existing datasets are limited in scope and do not capture the full range of threats and vulnerabilities. Improved datasets that address these limitations would enable faster progress in cybersecurity research. Our approach involves the design of a new network security dataset through interviews with the community, designing a dataset that uses real-world network traffic data, and doing known security attacks to create a diverse and representative dataset. The CTU-SME-11 dataset includes seven days of network traffic on eleven devices connected in an internal network. Those devices are of various operating systems, hardware, and intended use, which makes the dataset very heterogeneous. Apart from human-generated benign traffic, the dataset includes malware captures, attacks inside the network and from the internet, and attacks with data exfiltration. The biggest value of this dataset are ground-truth labels, which allow consumers to evaluate the performance of their models and algorithms accurately. This thesis describes the whole creation process of a network dataset of normal, malware, attack, and background traffic on a real network. The CTU-SME- 11 dataset contains in total around 160 GB of PCAP files and around 99,000,000 expert-labeled network flows. We hope that this dataset will serve as a foundation for future research in the field of network security datasets and will become a new benchmark dataset to be used by the cybersecurity community.
URL: https://dspace.cvut.cz/handle/10467/109254?show=full