This dataset was created as part of the Avast-AIC laboratory with the funding of Avast Software.
We are pleased to finally release the IoT-23 Dataset, a new dataset of malicious and benign network traffic of real IoT devices. In this blog post we aim to describe the dataset, how it was generated and where you can download it. We hope this dataset will bring researchers closer to the security of Internet of Things and encourage the creation of new and better ways of protect the millions of IoT devices that we already have in the industry and more importantly, at home.
The IoT-23 Dataset
IoT-23 is a new dataset of network traffic from Internet of Things (IoT) devices. Its goal is to offer a large dataset of real and labeled IoT malware infections and IoT benign traffic for researchers to develop machine learning algorithms.
The IoT-23 Dataset contains 20 captures of malware executed in IoT devices, and 3 captures of benign IoT devices traffic. The dataset contains more than 760 million packets and 325 million labeled flows of more than 500 hours of traffic. The captures were taken during 2018 and 2019 at the Stratosphere Laboratory, AIC group, FEL, CTU University, Czech Republic. This dataset and its research is funded by Avast Software, Prague.
The Structure of the Dataset
The IoT-23 dataset consists of twenty three captures, called scenarios, of different IoT network traffic. There are twenty malicious and three benign scenarios.
Every scenario contains the following basic information, among other content that will be described later:
README.md: this file has the capture and malware information such as the probable malware name, md5, sha1 and sha256 of the malware binary; the duration of the capture in seconds, the link to the VirusTotal malware file and some short description of the files inside the folder.
.pcap: this is the original packet capture file from the network traffic capture.
conn.log.labeled: this is the netflows generated by Zeek/Bro IDS with labels.
Labels
Both benign and malicious traffic flows have two new columns for network behavior description labels. These labels are assigned using the following process:
The original .pcap file is analyzed manually by a human analyst.
The suspicious flows are detected and labels are assigned in an analysis dashboard.
A labels.csv file is generated by the analyst, which contains a set of rules that are used for later labeling each netflow.
We created a python script that reads the data of each netflow (conn.log) and compares this data with the rules found in the labels.csv file. If the netflow fits the labeling criteria, the corresponding label is added.
At the end of this process, a new conn.log.labeled file exists which contains the original netflows plus the new label based on the human analyst analysis.
The Creation of the Dataset
All the IoT traffic captures that are included in the IoT-23 Dataset were captured from real IoT devices in the Aposemat Project, Avast-AIC Laboratory in Prague during 2018 and 2019.
The malicious scenarios were created executing a specific malware in a Raspberry Pi. In this dataset we include traffic from Mirai, Torii, Hide and Seek, Hajime and others. The physical set up of our laboratory was growing over time, however you can see one of our early set ups in the image below.
Malware captures are executed for long periods of time. Due to the large size of the traffic generated by each infection, we rotate the pcaps every 24 hours. However, in some cases, the capture files as they grew so fast they were stopped before the twenty-four hours were completed. For that reason, some of the captures differ in the amount of hours, as you may see later when we describe the characteristics of the Scenarios.
The network traffic capture for the benign scenarios was obtained by capturing the network traffic of three different IoT devices: a Philips HUE smart LED lamp, an Amazon Echo home intelligent personal assistant and a Somfy Smart Door Lock. It is important to mention that these three IoT devices are real hardware and not simulated (see Images 2,3 and 4) . Having real IoT devices allows us to capture and analyze real network behavior without any bias or issue that typically comes from simulated traffic.
Both malicious and benign scenarios run in a controlled network environment with unrestrained internet connection like any other real IoT device.
Characteristics of the IoT-23 Dataset
IoT-23 Malicious Scenarios
In Table I below we attempt to highlight some characteristics of each scenarion, such as the scenario number (ID), the name of the dataset, the duration in hours, the number of packets, the number of Zeek flows in the conn.log file, the size of the original pcap file and the possible name of the malware used to infect the device.
We aimed at having a diverse set of malware, but at the same time we tried to captured possible evolutions or changes on the same malware family.
To have some extra data regarding the network traffic generated by each infected device we used the application layer protocol prediction from Zeek to filter and summarize this information. In Table 2, this information is summarized. Some protocols however were not recognized by Zeek, hence we added a column where all this flows are quantified.
IoT-23 Benign Scenarios
In Table 3 we show some of the characteristics of the benign scenarios, including information regarding the duration, number of packets, number of Zeek flows, pcap file and the name of the device.
In the next table, Table 4, we show the application layer detected protocols for each one of the benign scenarios.
In Table 5 below, we show a breakdown of the number flows per label assigns on the complete IoT-23 Dataset. The numbers are shown in log scale. The three most common malicious (not benign flows) labels are: PartOfAHorizontalPortScan (213,852,924 flows), Okiru (47,381,241 flows) and DDoS (19,538,713 flows). While the three least common malicious (not benign flows) labels are: C&C-Mirai (2 flows), PartOfAHorizontalPortScan-Attack (5 flows) and C&C-HeartBeat-FileDownload (11 flows). It's important to clarify that this table only shows the labels of the twenty malicious scenarios and it does not include the three benign scenarios, this decision is made because the benign scenarios will only increment the benign label total.
Download the IoT-23 Dataset
The IoT-23 Dataset is fully available at https://www.stratosphereips.org/datasets-iot23
To facilitate the download of a dataset of these characteristics we provide two additional options for downloading it: a full compressed file of all the dataset, and a small sized compressed file with only the netflows of each scenario.
Full download link (20 GB): The Full Download includes the original .pcap, README.md and conn.log.labeled files. The size for the full version is 20GB.
https://mcfp.felk.cvut.cz/publicDatasets/IoT-23-Dataset/iot_23_datasets_full.tar.gzSmall download link (8.7 GB): The Small Download is a light version that only contains the README.md and the conn.log file. The size for the small version is 8.7GB. https://mcfp.felk.cvut.cz/publicDatasets/IoT-23-Dataset/iot_23_datasets_small.tar.gz
Additional Links
For a complete explanation of the labels and how they are assigned, please visit the IoT-23 Dataset webpage were we provide a in-depth detail of the labelling process: https://www.stratosphereips.org/datasets-iot23.
Citation
If you are using this dataset for your research, please reference it as “Stratosphere Laboratory. A labeled dataset with malicious and benign IoT network traffic. January 22th. Agustin Parmisano, Sebastian Garcia, Maria Jose Erquiaga.
https://www.stratosphereips.org/datasets-iot23
Contact
If you have further questions, don’t hesitate to contact us!
aposemat@aic.fel.cvut.cz
Acknowledgement
This research was done as part of our ongoing collaboration with Avast Software in the Aposemat project. The Aposemat project is funded by Avast Software.