How Internet Telemetry Data Becomes Threat Intelligence

Microsoft

Oct 25, 2022

Overview

It is often difficult to determine whether a security alert identified truly malicious activity without the ability to conduct additional research into the entities associated with the alert. Entities could include IP addresses, domain names, hostnames, URLs, file names or hashes, and more. Analysts will have to turn to outside sources to gather needed context on these entities to appropriately triage the activity that has been identified.

Defender TI is built on top of over a decade's worth of data collection against Internet datasets. The technologies in place enable data collection, processing, and storage at a scale unmatched by most in the industry. Improvements to the ability to search across and pivot through datasets occur on an ongoing basis, in conjunction with improving the ability for analysts to collaborate across research and investigations. This module will provide an overview of the primary methods by which Internet data is collected.

Figure 1 – Defender TI is all about defense and protection. It graphs the entire Internet to empower those functions.

Figure 2 – How does Defender TI work to graph the Internet?

Figure 3 – How does Defender TI collect, process, and provide raw and finished threat intelligence?

Figure 4 – More on graphing the Internet

Figure 5 – Observing the Internet through the eyes of an attacker

Passive DNS Sensor Network

Our worldwide sensor network ingests and stores hundreds of millions of unique records daily. This data provides analysts with insight into how a particular domain name, host, or IP address changes over time, enabling the identification of other related domains, hosts, and IP addresses. When researching a suspicious or malicious event, PDNS data can provide interesting data related to threat actor infrastructure and potentially surface additional domains and IP addresses for threat detection and blocking. A deeper look at this dataset is saved for the next module.

Web Crawling with Virtual Users

Our proprietary network of crawlers is "virtual users" that simulate human-web interactions and the full composition of internet assets—no agent required. The human-web simulation is the most scientific method for absorbing internet intelligence, namely causes and effects. By interacting with digital and Internet assets, our virtual users can extract every attribute that makes up the asset's behavior, including its edge (relational) behaviors. Microsoft performs 2 billion HTTP requests per day to crawl web and mobile pages across the Internet.

Understanding a web crawl is a reasonably straightforward process. Similar to how you digest data from pages you browse online, our web crawlers essentially do the same, only faster, automated, and made to store the entire chain of events. When web crawlers process web pages, they observe those pages' document object model (DOM) and take note of links, images, dependent content, components, trackers, and other details to construct a sequence of events and relationships. Web crawls are powered by an extensive set of configuration parameters that could dictate an exact URL starting point or something more complex, like a search engine query.

For most web crawls, once they have a starting point, they will perform the initial crawl, take note of all the links from within the page, and then crawl those follow-on pages completing the same process over again. Most web crawls have a depth limit that stops after 25 or so links outside the initial starting point to avoid crawling forever. Our configuration allows for some different parameters to be set that dictate the operation of the web crawl.

Figure 6 – How to blend in to find and map threat infrastructure

Figure 7 – Interacting like a user from a browser perspective

Global Proxy Network

Virtual users deploy from hundreds of rotating proxies worldwide to avoid detection, emanating from a combination of residential, commercial, and mobile egress points. Each of these is highly configurable to emulate a wide range of specific human-like behaviors, such as scrolling and clicking. They also imitate popular browsers, devices, applications, and operating systems. For example, having the ability to simulate a mobile phone browser in the region in which it's being targeted means the crawlers have a higher likelihood of observing the full exploitation chain.

A perfect example of why this is important can be found in a 2020 Akamai report describing a phishing scam targeting users in Brazil utilizing mobile devices. From the report, "When it comes to devices, the majority of the victims were mobile users running Android. Part of the reason for this is the websites, which were developed to only accept victims using mobile devices. A JavaScript agent, seen in Figure 6, checked the victim's User-Agent headers and redirected those who were not on a mobile device to Google News." A complete evaluation of the URLs associated with this campaign would only be enabled by the ability to customize our crawler configuration parameters.

Internet Scanning

Our systems conduct daily scans of select, well-known, registered, and dynamic ports across the entire IPv4 space to determine port state and collect data for detecting services. Scan data is relevant to a variety of use cases. From a network defense perspective, system administrators typically do port scanning to identify and shut down services that are not in use on hosts within an enterprise network. Network scans may also be conducted to create an inventory of machines and services available for things like asset tracking, network design, or policy compliance checks. From an attacker's perspective – or perhaps a Red Team's perspective – scanning is often the first step taken to identify systems available to compromise.

Keep in mind that Defender TI data only includes scans conducted from the Internet and, therefore can give you a piece of the puzzle when attempting to determine what your organization's footprint looks like from the outside. Since no agents are deployed internally to a network, you should not expect to find internal network scan data within Defender TI. The easiest way to see what this data looks like is by searching for an IP address in Defender TI, accessing the Data tab, and then looking for the Services dataset. This will be covered in greater depth in a later module.

Third-Party Collection

This public information can be about threat groups, networks, businesses, corporations, or any other sources of relevant data. This data is freely available from Internet sources, except those listed with asterisks below. Sources include websites, blog posts, social media platforms, and other public-facing digital assets. Some of the data we collect:

PDNS A records
- Defender TI collects PDNS A records internally through its PDNS global sensor network. However, various 3^rd parties also contribute to sharing A record resolutions.
Certificates
- Certificates are observed from Defender TI's port scans and virtual user crawls. However, Defender TI captures transparency logs from Certstream.
*Reputation, *malware, and *phish feeds
- Defender TI's Reputation feature's rules consider data collected internally and externally, such as blocklist information captured by 3^rd party threat sharing.
- Defender TI's malware and phish feed indicators are collected internally and through 3^rd party sharing.
OSINT collection
- Defender TI crawls 60+ sources daily to vet which new intelligence articles are viable to republish to the Defender TI portal.
- Defender TI crawls include scraping the DOM and images (OCR) of those OSINT webpages to capture indicators of compromise and intelligence article content.
CVE-IDs
- CVE-IDs are indexed daily from MITRE's list and NIST's National Vulnerability Database.
Whois data
- Whois records are indexed from various sources on a recurring basis.