Zeek Package: IRC Feature Extractor
Zeek Package IRC Feature Extractor extends the functionality of Zeek[1] network analysis framework. We create IRC Feature Extractor Zeek Package to automatically recognize IRC communication in a packet capture (pcap) file and to extract features from it. The goal for the feature extraction is to describe an individual IRC communications that occur in the pcap file as accurately as possible. The package was created during our research in the Aposemat project[2], a joint project between Avast and CVUT, where we proposed a technique for detecting malicious IRC communications in the network.
Installation
To install the package, run the following command in a terminal:
$ zkg install IRC-Zeek-package
Run
To extract the IRC features on the selected pcap file that contains IRC, run the following command:
$ zeek IRC-Zeek-package -r file.pcap
The output will be stored in irc_features.log
file in zeek log format. The log will look like this:
#separator \x09 #set_separator , #empty_field (empty) #unset_field - #path irc_features #open 2020-01-27-21-54-41 #fields src src_ip src_ports_count dst dst_ip dst_port start_time end_time duration msg_count size_total periodicity spec_chars_username_mean spec_chars_msg_mean msg_word_entropy #types string addr count string addr port time time double count int double double double double T!T@null 192.168.100.103 4 #a925d765 111.230.241.23 2407 1532322898.819018 1534860900.996931 2538002.177913 23 48705 0.050294 0.25 0.908075 2.001506 T!T@null 192.168.100.103 33 #a925d765 185.61.149.22 2407 1530166710.153128 1535620500.535362 5453790.382234 231 562890 1.0 0.25 0.908256 2.276218 #close 2020-01-27-21-54-41
Every line consists of a line descriptor followed by a content described by the descriptor. Lines 1-5 describes predefined values that determine the structure of the log. Line 6 indicates the time when the package starts evaluation and Line 10 when the package ends the evaluation. Line 7 contains extracted feature names, line 8 contains data types of each feature, and line 9 contains feature values.
Package Description
Once the data was obtained from network traffic capture, there was a process to extract the features. We separated the whole pcap into IRC connections for each individual user. In our research the IRC connection is a flow between the source IP, destination IP, and destination port (hereinafter IRC connection). The source port is neglected in separation to include multiple TCP connections in a single IRC connection - when a new TCP connection is established between two IP addresses, the source port is randomly chosen from the unregistered port range, and that is why the source port differs in multiple TCP connections. This is shown in Figure 1, where there are two connections from the source IP address (192.168.0.1) to the same destination IP address (192.168.0.2) using different source port.
Extracted Features
Here, we will describe the complete list of features that are extracted by the package for each IRC Connection that we obtained from a pcap file. The features were manually chosen to provide us a meaningful representation of the IRC connection biased towards the malware detection we were trying to solve.
Total Packet Size
Size of total amount of all packets in bytes that were sent in IRC connection. It reflects how many messages were sent and how long they were.
Session Duration
Time duration of IRC connection in milliseconds - i.e., the difference between the time of the last message and the first message in IRC connection.
Number of Messages
A total number of messages in a given IRC connection.
Number of Source Ports
As we have mentioned before, the source port is neglected in unifying communication into IRC connections because it is randomly chosen when a TCP connection is established. We suppose that artificial users could use a higher number of source ports than the real users since the number of connections of the artificial users was higher than the number of connections of the real users.
Message Periodicity
We suppose that artificial users (e.g., bots that are controlled by botnet master) use IRC for sending commands periodically, so we wanted to obtain that value. To do that, we created a method that would return a number between 0 and 1 - i.e. one if the message sequence is perfectly periodical, zero if the message sequence is not periodical at all.
To compute message periodicity, we firstly compute time differences between every message. On this computed sequence of numbers, we apply a fast Fourier transform (FFT). The output of FFT is a sequence of numbers. The higher the number on the given position of the output, the bigger the amplitude on the given position.Thus it has a more significant influence on the periodicity of the data. The position of the largest element in the FFT's output represents the length of the period, which is the most significant from all other periods.
To compute the quality of the most significant period, we split the data by length of that period. Then we compute the normalised mean squared error (NMSE) that returns us the resulting number in the interval between 0 and 1 where 1 represents the perfectly periodic messages, and 0 represents not periodic messages at all.
The described process of extracting message periodicity feature is illustrated in Figure 2.
Message Word Entropy
To consider whether the user sends the same message multiple times in a row, or whether the message contains a limited number of words, we compute a word entropy across all of the messages in the IRC connection. By the term word entropy we mean a measure of words uncertainty in the message. For the computation of the word entropy, we use the formula below:
where n represents the number of words, and pi represents the probability that the word i will be used among all other words.
Username Special Characters Mean
We want to obtain whether the username of the user in the IRC communication is random generated or not. Therefore, in this feature, we compute the average usage of non-alphabetic characters in the username.
Message Special Characters Mean
If the artificial user sends many commands, the message will most likely contain a lot of different characters than the message of an ordinary user would send. With this feature, we obtain the average usage of non-alphabetic characters across all messages in the IRC connection. We apply the same procedure of matching special characters for each message as in the previous case - we match non-alphabetic characters by regex, and then we divide the number of matched characters by the total number of message characters. Finally, we compute an average of all the obtained values for each message.
Download
Link to the package: https://github.com/stratosphereips/IRC-Zeek-package/
References
[1] Zeek Network Security Monitor: https://www.zeek.org
[2] Zeek Package Manager: https://packages.zeek.org
[3] Aposemat Project: https://www.stratosphereips.org/aposemat