Honeypots are essential tools in cyber security. However, most of them lack the required realism to engage and fool human attackers long-term. This limitation makes them easily discernible, hindering their effectiveness. This can happen because they are too deterministic, lack adaptability, or lack deepness. Researchers tried to make honeypots more realistic by implementing more functionalities or by making use of various machine learning and natural language processing techniques. However, the majority of high-interaction software honeypots available today require a lot of time to be set up and are still detectable by attackers due to the inability to simulate lots of commands or services. This thesis proposes a novel method to create realistic and dynamic software honeypots using Large Language Models (LLMs). Our tool called VelLMes can simulate various services such as SSH, MySQL, POP3, and HTTP. Each of these services can be used as a honeypot, accessed via shell, and their content and responses are dynamically generated based on the conversation history. All the services are implemented using cloud-based models. Preliminary results show that LLMs can help solve issues the standard honeypots have, such as deterministic responses, lack of adaptability, etc. The most complex one out of the implemented services is the SSH Linux honeypot, which we call shelLM. It can mimic a Linux shell allowing users the complete freedom of interaction. The final part of our proposal is an LLM attacker tool. It is an LLM-based software that can interact with Linux shell and shelLM, learn about the system, find vulnerabilities, and report of its findings. We evaluated the behavior of VelLMes and shelLM with three experiments. The first experiment evaluated if shelLM can generate output as expected from a Linux shell. The evaluation was done by asking cyber security professionals to use shelLM and give feedback if each answer from the honeypot was the expected one from a Linux shell. The results indicate that shelLM can create credible and dynamic answers addressing the limitations of current honeypots. The shelLM reached a TNR of 0.9, convincing humans it was consistent with a real Linux shell. The second experiment evaluated if VelLMes can generate realistic output for all the simulated services. This experiment included automated testing. We came up with the idea and implemented unit tests for LLMs. For each service in VelLMes, we run the tests comparing the behavior of various cloud models. In the case of shelLM, we also compared the performance of cloud and local models. The experiment results indicate that LLMs are good at behaving as various shell-accessible services. The results also show that cloud models outperform the local models we used for comparison. The final experiment evaluated shelLM's deception abilities and compared it to a popular honeypot Cowrie. The evaluation was done by asking 34 human participants to interact with a system randomly assigned to them and later decide whether it was a honeypot. The participants interacted either with a regular Ubuntu, Cowrie, or shelLM. The results indicate that shelLM is better at deception than Cowrie, managing to fool almost 45% of participants who interacted with it.
URL: https://dspace.cvut.cz/handle/10467/115799?locale-attribute=en