Domain Generation algorithm (DGA) is an automation technique used by cyber attackers for a variety of attacks like Data exfiltration, command and control and DNS tunnelling and to make it harder for the company’s defenses to detect them.
Domain Generation Algorithm (DGA)
Threat actors are always seeking a way to evade company’s defenses. The more progressive their method, the more successful they are in evading security controls that uses static methods. It is critical for the business to detect such attacks at the beginning stages of the attack life cycle to reduce the impact and the cost of recovery.
This post explains what is DGA and how does it work? What make it difficult to detect and an interesting way to detect it using machine learning.
A brief description of DGA and how it works
Domain Generation Algorithm (DGA) is a program that provides new domains on demand to the malware. DGAs produce a list of domains used by malware clients to communicate with a sequence of command and control (C&C) sites. If one of the dynamically generated domains is detected and blocked by IT security, the malware client and C&C server switches to the next one on the list to evade defenses. Before DGA, most malicious programs used hardcoded lists of IP addresses or domains.
What makes it difficult to detect
DGA continuously changes domains used for communication in order to evade detection by security vendors using traditional methods of blacklisting Domains.
Normally DGAs looks like random combination of strings which can be easily identified. But DGA can use dictionaries to generate domain, it will take words from dictionaries and concatenate them in different combinations to produce multiple domains. The dictionary generated domains by DGAs are extremely difficult to detect since it is very much looks like legitimate ones. Some example patterns are given below.
DGA domain Detection
How to detect domains generated by DGA? There are two ways to detect it, the first one is the conventional reverse engineering-based approach and the other one is machine learning approach. The first one requires lot of statistical data and dedicated man hours, and it will be very difficult to reach the goal at times.
But in machine learning approach the domain name itself is used to identify the pattern. The machine Learning Algorithm is trained with a set of benign and malicious data to detect the underlying pattern.
The deep learning approach to detect DGA domains
Our team of Machine learning engineers and data scientists at HAWKEYE have worked on the taunting challenge of detecting DGA based attacks and developed an algorithm based on deep learning approach. Deep learning is a subset of machine learning which works on neural networks-based algorithms. The algorithm used here is a recursive neural network called LSTM (Long Short-Term Memory).
The LSTM based algorithm is trained with close to 2 million DGA and benign domain names collected from various sources, that covers huge variety of common DGA patterns used by threat actors.
In the above sample dataset 1 is malicious (DGA domain) and 0 is legitimate domain.
DGA Detection Algorithm Architecture
It was a quick overview of DGAs and the machine learning approach to detect DGA domains through pattern recognition. Various techniques have been proposed in the past for capturing and detecting malware-generated domain names based on a variety of static and dynamic features. As part of the machine learning approach proposed here, the framework is designed to always be capable of detecting and predicting all possible types of algorithmically generated domain names. we developed a recursive neural network-based LSTM algorithm that detects DGAs with a 98% accuracy rate. Organizations can use this method to detect DGA C2C attack at a very early stage to minimize the impact.
The algorithm deployed on HAWKEYE XDR stack has given an unprecedented boost in the detection and response capabilities in for our SOC team for a variety of attack techniques where DGA is used.