Blog Post

Core Infrastructure and Security Blog
4 MIN READ

Sentinel Notebook: Guided Hunting - Domain Generation Algorithm (DGA) Detection

jaredgraff's avatar
jaredgraff
Icon for Microsoft rankMicrosoft
May 12, 2025

Overview

This notebook, titled “Guided Hunting - Domain Generation Algorithm (DGA) Detection”, provides a framework for investigating anomalous network activity by identifying domains generated by algorithms, which are often used by malware to evade detection. It integrates data from Log Analytics (DeviceNetworkEvents) and employs Python-based tools and libraries such as “msticpy”, “pandas”, and “scikit-learn” for analysis and visualization. DGA detection is crucial for cybersecurity as it helps identify and mitigate threats like botnets and malware that use dynamically generated domains for command-and-control communication, making it a key component in proactive threat hunting and network defense.

Link:

https://github.com/GonePhishing402/SentinelNotebooks/blob/main/DGA_Detection_ManagedIdentity.ipynb

What is Domain Generation Algorithm and How to Detect it?

A Domain Generation Algorithm (DGA) is a technique used by malware to create numerous domain names for communicating with Command and Control (C2) servers, ensuring continued operation even if some domains are blocked. DGAs evade static signature detection by dynamically generating unpredictable domain names, making it hard for traditional security methods to identify and blacklist them. Machine learning models can effectively detect DGAs by analyzing patterns and features in domain names, leveraging techniques like deep learning to adapt to new variants and identify anomalies that static methods may miss.

How to Run the Notebook

Log in with Managed Identity

This notebook requires you to authenticate with a managed identity. The managed identity can be created from the Azure portal and must have the following RBAC:

-          Sentinel Contributor

-          Log Analytics Contributor

-          AzureML Data Scientist

-          AzureML Compute Operator

Replace the [CLIENT_ID] with the client ID for your managed identity. This can be obtained from the Azure portal under Managed Identities -> Select the identity -> Overview. Note: This notebook will still work if you choose to authenticate with just an azure user using the CLI method as well.

Import Libraries

This code block is used to import the necessary libraries and label the “credential” variable to use the ManagedIdentityCredential() library.

Setup msticpyconfig.yaml

This section just pulls the msticpyconfig.yaml to use later on in the notebook. Ensure this is setup before running this notebook and in your current working directory.

Setup QueryProvider

The query provider is setup for Azure Sentinel. This does not need to be changed unless you want to use a different query provider from msticpy.

Connect to Sentinel

This code block is used to connect to Sentinel with the managed identity to the workspace specified in your msticpyconfig.yaml. You should see a “connected” after running this code block.

DGA Model Creation

This code block is designed to use CountVectorizer() and MultinomialNB() to create a model called dga_model.joblib and save it to the path specified in the “model_filename” variable. It is important to change this path specific to your environment. You must give the algorithm data to learn from in order to be effective. Download the domain.csv located here and upload to your current working directory on Azure Machine Learning Workspace:
DGA_Detection/data/domain.csv at master · hmaccelerate/DGA_Detection

You must also change line 10 in this code block to have the “labeled_domains_df” point to the domain.csv in your environment. Once you run the code block, you should see the model saved and the model accuracy. This number will vary depending on the data you are giving it.

Apply dga_model.joblib to Sentinel Data

This code block uses the model that we generated in the previous block and runs it against our data we specific in the “query” variable. This is using domain names from the “DeviceNetworkEvents” table in MDE events. The “parse_json” was used in our KQL to extract the appropriate sub-field needed for this search. When this model is run against the data, it will try to determine if any domains in our environment are associated with domain generation algorithms (DGA). If the “IsDGA” column contains a value of “True”, the model has determined that the characteristics of that domain matches a DGA.  Here is what the output will look like:

 

Output All Results to CSV

This code block will output all the results above to a CSV called “dgaresults.csv”. Change the “output_path” variable to match your environment.

Filter DGA Results to CSV

This code block will output just the DGA results above to a CSV called “dgaresults2.csv”. Change the “output_path” variable to match your environment.

How to Investigate these Results Further

You can take the domain results that match DGA and find the correlating IP to see if it matches any threat intelligence. Correlate findings with other security logs, such as firewall or endpoint data, to uncover patterns of malicious behavior. This approach helps pinpoint potential threats and enables proactive mitigation. We can also create logic apps to automate follow-on analysis of these notebooks. This will be covered in a later blog.

Updated May 12, 2025
Version 1.0
No CommentsBe the first to comment