content delivery network
13 TopicsProhibiting Domain Fronting with Azure Front Door and Azure CDN Standard from Microsoft (classic)
Azure Front Door and Azure CDN Standard from Microsoft (classic) are postponing the domain fronting blocking enforcement to January 22, 2024, and will add two log fields to help you check if your resources display domain fronting behavior by December 25, 2023.25KViews4likes15CommentsAzure Front Door: Implementing lessons learned following October outages
Abhishek Tiwari, Vice President of Engineering, Azure Networking Amit Srivastava, Principal PM Manager, Azure Networking Varun Chawla, Partner Director of Engineering Introduction Azure Front Door is Microsoft's advanced edge delivery platform encompassing Content Delivery Network (CDN), global security and traffic distribution into a single unified offering. By using Microsoft's extensive global edge network, Azure Front Door ensures efficient content delivery and advanced security through 210+ global and local points of presence (PoPs) strategically positioned closely to both end users and applications. As the central global entry point from the internet onto customer applications, we power mission critical customer applications as well as many of Microsoft’s internal services. We have a highly distributed resilient architecture, which protects against failures at the server, rack, site and even at the regional level. This resiliency is achieved by the use of our intelligent traffic management layer which monitors failures and load balances traffic at server, rack or edge sites level within the primary ring, supplemented by a secondary-fallback ring which accepts traffic in case of primary traffic overflow or broad regional failures. We also deploy a traffic shield as a terminal safety net to ensure that in the event of a managed or unmanaged edge site going offline, end user traffic continues to flow to the next available edge site. Like any large-scale CDN, we deploy each customer configuration across a globally distributed edge fleet, densely shared with thousands of other tenants. While this architecture enables global scale, it carries the risk that certain incompatible configurations, if not contained, can propagate broadly and quickly which can result in a large blast radius of impact. Here we describe how the two recent service incidents impacting Azure Front Door have reinforced the need to accelerate ongoing investments in hardening our resiliency, and tenant isolation strategy to mitigate likelihood and the scale of impact from this class of risk. October incidents: recap and key learnings Azure Front Door experienced two service incidents; on October 9 th and October 29 th , both with customer-impacting service degradation. On October 9 th : A manual cleanup of stuck tenant metadata bypassed our configuration protection layer, allowing incompatible metadata to propagate beyond our canary edge sites. This metadata was created on October 7 th , from a control-plane defect triggered by a customer configuration change. While the protection system initially blocked the propagation, the manual override operation bypassed our safeguards. This incompatible configuration reached the next stage and activated a latent data-plane defect in a subset of edge sites, causing availability impact primarily across Europe (~6%) and Africa (~16%). You can learn more about this issue in detail at https://aka.ms/AIR/QNBQ-5W8 On October 29 th : A different sequence of configuration changes across two control-plane versions produced incompatible metadata. Because the failure mode in the data-plane was asynchronous, the health checks validations embedded in our protection systems were all passed during the rollout. The incompatible customer configuration metadata successfully propagated globally through a staged rollout and also updated the “last known good” (LKG) snapshot. Following this global rollout, the asynchronous process in data-plane exposed another defect which caused crashes. This impacted connectivity and DNS resolutions for all applications onboarded to our platform. Extended recovery time amplified impact on customer applications and Microsoft services. You can learn more about this issue in detail at https://aka.ms/AIR/YKYN-BWZ We took away a number of clear and actionable lessons from these incidents, which are applicable not just to our service, but to any multi-tenant, high-density, globally distributed system. Configuration resiliency – Valid configuration updates should propagate safely, consistently, and predictably across our global edge, while ensuring that incompatible or erroneous configuration never propagate beyond canary environments. Data plane resiliency - Additionally, configuration processing in the data plane must not cause availability impact to any customer. Tenant isolation – Traditional isolation techniques such as hardware partitioning and virtualization are impractical at edge sites. This requires innovative sharding techniques to ensure single tenant-level isolation – a must-have to reduce potential blast radius. Accelerated and automated recovery time objective (RTO) – System should be able to automatically revert to last known good configuration in an acceptable RTO. In case of a service like Azure Front Door, we deem ~10 mins to be a practical RTO for our hundreds of thousands of customers at every edge site. Post outage, given the severity of impact which allowed an incompatible configuration to propagate globally, we made the difficult decision to temporarily block configuration changes in order to expedite rollout of additional safeguards. Between October 29 th to November 5 th , we prioritized and deployed immediate hardening steps before opening up the configuration change. We are confident that the system is stable, and we are continuing to invest in additional safeguards to further strengthen the platform's resiliency. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond Canary · Control plane and data plane defect fixes · Forced synchronous configuration processing · Additional stages with extended bake time · Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. January 2026 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes March 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 This blog is the first in a multi-part series on Azure Front Door resiliency. In this blog, we will focus on configuration resiliency—how we are making the configuration pipeline safer and more robust. Subsequent blogs will cover tenant isolation and recovery improvements. How our configuration propagation works Azure Front Door configuration changes can be broadly classified into three distinct categories. Service code & data – these include all aspects of Azure Front Door service like management plane, control plane, data plane, configuration propagation system. Azure Front Door follows a safe deployment practice (SDP) process to roll out newer versions of management, control or data plane over a period of approximately 2-3 weeks. This ensures that any regression in software does not have a global impact. However, latent bugs that escape pre-validation and SDP rollout can remain undetected until a specific combination of customer traffic patterns or configuration changes trigger the issue. Web Application Firewall (WAF) & L7 DDoS platform data – These datasets are used by Azure Front Door to deliver security and load-balancing capabilities. Examples include GeoIP data, malicious attack signatures, and IP reputation signatures. Updates to these datasets occur daily through multiple SDP stages with an extended bake time of over 12 hours to minimize the risk of global impact during rollout. This dataset is shared across all customers and the platform, and it is validated immediately since it does not depend on variations in customer traffic or configuration steps. Customer configuration data – Examples of these are any customer configuration change—whether a routing rule update, backend pool modification, WAF rule change, or security policy change. Due to the nature of these changes, it is expected across the edge delivery / CDN industry to propagate these changes globally in 5-10 mins. Both outages stemmed from issues within this category. All configuration changes, including customer configuration data, are processed through a multi-stage pipeline designed to ensure correctness before global rollout across Azure Front Door’s 200+ edge locations. At a high level, Azure Front Door’s configuration propagation system has two distinct components - Control plane – Accepts customer API/portal changes (create/update/delete for profiles, routes, WAF policies, origins, etc.) and translates them into internal configuration metadata which the data plane can understand. Data plane – Globally distributed edge servers that terminate client traffic, apply routing/WAF logic, and proxy to origins using the configuration produced by the control plane. Between these two halves sits a multi-stage configuration rollout pipeline with a dedicated protection system (known as ConfigShield): Changes flow through multiple stages (pre-canary, canary, expanding waves to production) rather than going global at once. Each stage is health-gated: the data plane must remain within strict error and latency thresholds before proceeding. Each stage’s health check also rechecks previous stage’s health for any regressions. A successfully completed rollout updates a last known good (LKG) snapshot used for automated rollback. Historically, rollout targeted global completion in roughly 5–10 minutes, in line with industry standards. Customer configuration processing in Azure Front Door data plane stack Customer configuration changes in Azure Front Door traverse multiple layers—from the control plane through the deployment system—before being converted into FlatBuffers at each Azure Front Door node. These FlatBuffers are then loaded by the Azure Front Door data plane stack, which runs as Kubernetes pods on every node. FlatBuffer Composition: Each FlatBuffer references several sub-resources such as WAF and Rules Engine schematic files, SSL certificate objects, and URL signing secrets. Data plane architecture: o Master process: Accepts configuration changes (memory-mapped files with references) and manages the lifecycle of worker processes. o Workers: L7 proxy processes that serve customer traffic using the applied configuration. Processing flow for each configuration update: Load and apply in master: The transformed configuration is loaded and applied in the master process. Cleanup of unused references occurs synchronously except for certain categories à October 9 outage occurred during this step due to a crash triggered by incompatible metadata. Apply to workers: Configuration is applied to all worker processes without memory overhead (FlatBuffers are memory-mapped). Serve traffic: Workers start consuming new FlatBuffers for new requests; in-flight requests continue using old buffers. Old buffers are queued for cleanup post-completion. Feedback to deployment service: Positive feedback signals readiness for rollout.Cleanup: FlatBuffers are freed asynchronously by the master process after all workers load updates à October 29 outage occurred during this step due to a latent bug in reference counting logic. The October incidents showed we needed to strengthen key aspects of configuration validation, propagation safeguards, and runtime behavior. During the Azure Front Door incident on October 9 th , that protection system worked as intended but was later bypassed by our engineering team during a manual cleanup operation. During this Azure Front Door incident on October 29 th , the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Configuration propagation safeguards Based on learnings from the incidents, we are implementing a comprehensive set of configuration resiliency improvements. These changes aim to guarantee that any sequence of configuration changes cannot trigger instability in the data plane, and to ensure quicker recovery in the event of anomalies. Strengthening configuration generation safety This improvement pivots on a ‘shift-left’ strategy where we want to ensure that we catch regression early before they propagate to production. It also includes fixing the latent defects which were the proximate cause of the outage. Fixing outage specific defects - We have fixed the control-plane defects that could generate incompatible tenant metadata under specific operation sequences. We have also remediated the associated data-plane defects. Stronger cross-version validation - We are expanding our test and validation suite to account for changes across multiple control plane build versions. This is expected to be fully completed by February 2026. Fuzz testing - Automated fuzzing and testing of metadata generation contract between the control plane and the data plane. This allows us to generate an expanded set of invalid/unexpected configuration combinations which might not be achievable by traditional test cases alone. This is expected to be fully completed by February 2026. Preventing incompatible configurations from being propagated This segment of the resiliency strategy strives to ensure that a potentially dangerous configuration change never propagates beyond canary stage. Protection system is “always-on” - Enhancements to operational procedures and tooling prevent bypass in all scenarios (including internal cleanup/maintenance), and any cleanup must flow through the same guarded stages and health checks as standard configuration changes. This is completed. Making rollout behavior more predictable and conservative - Configuration processing in the data plane is now fully synchronous. Every data plane issue due to incompatible meta data can be detected withing 10 seconds at every stage. This is completed. Enhancement to deployment pipeline - Additional stages during roll-out and extended bake time between stages serve as an additional safeguard during configuration propagation. This is completed. Recovery tool improvements now make it easier to revert to any previous version of LKG with a single click. This is completed. These changes significantly improve system safety. Post-outage we have increased the configuration propagation time to approximately 45 minutes. We are working towards reducing configuration propagation time closer to pre-incident levels once additional safeguards covered in the Data plane resiliency section below are completed by mid-January, 2026. Data plane resiliency The data plane recovery was the toughest part of recovery efforts during the October incidents. We must ensure fast recovery as well as resilience to configuration processing related issues for the data plane. To address this, we implemented changes that decouple the data plane from incompatible configuration changes. With these enhancements, the data plane continues operating on the last known good configuration—even if the configuration pipeline safeguards fail to protect as intended. Decoupling data plane from configuration changes Each server’s data plane consists of a master process which accepts configuration changes and manages lifecycle of multiple worker processes which serve customer traffic. One of the critical reasons for the prolonged outage in October was that due to latent defects in the data plane, when presented with a bad configuration the master process crashed. The master is a critical command-and-control process and when it crashes it takes down the entire data plane, in that node. Recovery of the master process involves reloading hundreds of thousands of configurations from scratch and took approximately 4.5 hours. We have since made changes to the system to ensure that even in the event of the master process crash due to any reason - including incompatible configuration data being presented - the workers remain healthy and able to serve traffic. During such an event, the workers would not be able to accept new configuration changes but will continue to serve customer traffic using the last known good configuration. This work is completed. Introducing Food Taster: strengthening config propagation resiliency In our efforts to further strengthen Azure Front Door’s configuration propagation system, we are introducing an additional configuration safeguard known internally as Food Taster which protects the master and worker processes from any configuration change related incidents, thereby ensuring data plane resiliency. The principle is simple: every data-plane server will have a redundant and isolated process – the Food Taster – whose only job is to ingest and process new configuration metadata first and then pass validated configuration changes to active data plane. This redundant worker does not accept any customer traffic. All configuration processing in this Food Taster is fully synchronous. That means we do all parsing, validation, and any expensive or risky work up front, and we do not move on until the Food Taster has either proven the configuration is safe or rejected it. Only when the Food Taster successfully loads the configuration and returns “Config OK” does the master process proceed to load the same config and then instruct the worker processes to do the same. If anything goes wrong in the Food Taster, the failure is contained to that isolated worker; the master and traffic-serving workers never see that invalid configuration. We expect this safeguard to reach production globally in January 2026 timeframe. Introduction of this component will also allow us to return closer to pre-incident level of configuration propagation while ensuring data plane safety. Closing This is the first in a series of planned blogs on Azure Front Door resiliency enhancements. We are continuously improving platform safety and reliability and will transparently share updates through this series. Upcoming posts will cover advancements in tenant isolation and improvements to recovery time objectives (RTO). We deeply value our customers’ trust in Azure Front Door. The October incidents reinforced how critical configuration resiliency is, and we are committed to exceeding industry expectations for safety, reliability, and transparency. By hardening our configuration pipeline, strengthening safety gates, and reinforcing isolation boundaries, we’re making Azure Front Door even more resilient so your applications can be too.9.9KViews19likes7CommentsRevolutionizing hyperscale application delivery and security: The New Azure Front Door edge platform
In this introductory blog to the new Azure Front Door next generation platform, we will go over the motivations, design choices and learnings from this undertaking which helped us successfully achieve massive gains in scalability, security and resiliency.7KViews12likes0CommentsAzure Networking Portfolio Consolidation
Overview Over the past decade, Azure Networking has expanded rapidly, bringing incredible tools and capabilities to help customers build, connect, and secure their cloud infrastructure. But we've also heard strong feedback: with over 40 different products, it hasn't always been easy to navigate and find the right solution. The complexity often led to confusion, slower onboarding, and missed capabilities. That's why we're excited to introduce a more focused, streamlined, and intuitive experience across Azure.com, the Azure portal, and our documentation pivoting around four core networking scenarios: Network foundations: Network foundations provide the core connectivity for your resources, using Virtual Network, Private Link, and DNS to build the foundation for your Azure network. Try it with this link: Network foundations Hybrid connectivity: Hybrid connectivity securely connects on-premises, private, and public cloud environments, enabling seamless integration, global availability, and end-to-end visibility, presenting major opportunities as organizations advance their cloud transformation. Try it with this link: Hybrid connectivity Load balancing and content delivery: Load balancing and content delivery helps you choose the right option to ensure your applications are fast, reliable, and tailored to your business needs. Try it with this link: Load balancing and content delivery Network security: Securing your environment is just as essential as building and connecting it. The Network Security hub brings together Azure Firewall, DDoS Protection, and Web Application Firewall (WAF) to provide a centralized, unified approach to cloud protection. With unified controls, it helps you manage security more efficiently and strengthen your security posture. Try it with this link: Network security This new structure makes it easier to discover the right networking services and get started with just a few clicks so you can focus more on building, and less on searching. What you’ll notice: Clearer starting points: Azure Networking is now organized around four core scenarios and twelve essential services, reflecting the most common customer needs. Additional services are presented within the context of these scenarios, helping you stay focused and find the right solution without feeling overwhelmed. Simplified choices: We’ve merged overlapping or closely related services to reduce redundancy. That means fewer, more meaningful options that are easier to evaluate and act on. Sunsetting outdated services: To reduce clutter and improve clarity, we’re sunsetting underused offerings such as white-label CDN services and China CDN. These capabilities have been rolled into newer, more robust services, so you can focus on what’s current and supported. What this means for you Faster decision-making: With clearer guidance and fewer overlapping products, it's easier to discover what you need and move forward confidently. More productive sales conversations: With this simplified approach, you’ll get more focused recommendations and less confusion among sellers. Better product experience: This update makes the Azure Networking portfolio more cohesive and consistent, helping you get started quickly, stay aligned with best practices, and unlock more value from day one. The portfolio consolidation initiative is a strategic effort to simplify and enhance the Azure Networking portfolio, ensuring better alignment with customer needs and industry best practices. By focusing on top-line services, combining related products, and retiring outdated offerings, Azure Networking aims to provide a more cohesive and efficient product experience. Azure.com Before: Our original Solution page on Azure.com was disorganized and static, displaying a small portion of services in no discernable order. After: The revised solution page is now dynamic, allowing customers to click deeper into each networking and network security category, displaying the top line services, simplifying the customer experience. Azure Portal Before: With over 40 networking services available, we know it can feel overwhelming to figure out what’s right for you and where to get started. After: To make it easier, we've introduced four streamlined networking hubs each built around a specific scenario to help you quickly identify the services that match your needs. Each offers an overview to set the stage, key services to help you get started, guidance to support decision-making, and a streamlined left-hand navigation for easy access to all services and features. Documentation For documentation, we looked at our current assets as well as created new assets that aligned with the changes in the portal experience. Like Azure.com, we found the old experiences were disorganized and not well aligned. We updated our assets to focus on our top-line networking services, and to call out the pillars. Our belief is these changes will allow our customers to more easily find the relevant and important information they need for their Azure infrastructure. Azure Network Hub Before the updates, we had a hub page organized around different categories and not well laid out. In the updated hub page, we provided relevant links for top-line services within all of the Azure networking scenarios, as well as a section linking to each scenario's hub page. Scenario Hub pages We added scenario hub pages for each of the scenarios. This provides our customers with a central hub for information about the top-line services for each scenario and how to get started. Also, we included common scenarios and use cases for each scenario, along with references for deeper learning across the Azure Architecture Center, Well Architected Framework, and Cloud Adoption Framework libraries. Scenario Overview articles We created new overview articles for each scenario. These articles were designed to provide customers with an introduction to the services included in each scenario, guidance on choosing the right solutions, and an introduction to the new portal experience. Here's the Load balancing and content delivery overview: Documentation links Azure Networking hub page: Azure networking documentation | Microsoft Learn Scenario Hub pages: Azure load balancing and content delivery | Microsoft Learn Azure network foundation documentation | Microsoft Learn Azure hybrid connectivity documentation | Microsoft Learn Azure network security documentation | Microsoft Learn Scenario Overview pages What is load balancing and content delivery? | Microsoft Learn Azure Network Foundation Services Overview | Microsoft Learn What is hybrid connectivity? | Microsoft Learn What is Azure network security? | Microsoft Lea Improving user experience is a journey and in coming months we plan to do more on this. Watch out for more blogs over the next few months for further improvements.3KViews4likes0CommentsIssue with Azure VM Conditional Access for Office 365 and Dynamic Public IP Detection
Hi all, I have a VM in Azure where I need to allow an account with MFA to bypass the requirement on this specific server when using Office 365. I've tried to achieve this using Conditional Access by excluding locations, specifically the IP range of my Azure environment. Although I’ve disconnected any public IPs from this server, the Conditional Access policy still isn’t working as intended. The issue seems to be that it continues to detect a public IP, which changes frequently, making it impossible to exclude. What am I doing wrong?1.7KViews0likes5CommentsAccelerate designing, troubleshooting & securing your network with Gen-AI powered tools, now GA.
We are thrilled to announce the general availability of Azure Networking skills in Copilot, an extension of Copilot in Azure and Security Copilot designed to enhance cloud networking experience. Azure Networking Copilot is set to transform how organizations design, operate, and optimize their Azure Network by providing contextualized responses tailored to networking-specific scenarios and using your network topology.1.7KViews1like1CommentUnmasking DDoS Attacks (Part 1/3)
In today’s always-online world, we take uninterrupted access to websites, apps, and digital services for granted. But lurking in the background is a cyber threat that can grind everything to a halt in an instant: DDoS attacks. These attacks don’t sneak in to steal data or plant malware—they’re all about chaos and disruption, flooding servers with so much traffic that they crash, slow down, or completely shut off. Over the years, DDoS attacks have evolved from annoying nuisances to full-blown cyber weapons, capable of hitting massive scales—some even reaching terabit-level traffic. Companies have lost millions of dollars due to downtime, and even governments and critical infrastructure have been targeted. Whether you’re a CTO, a business owner, a security pro, or just someone who loves tech, understanding these attacks is key to stopping them before they cause real damage. That’s where this blog series comes in. We’ll be breaking down everything you need to know about DDoS attacks—how they work, real-world examples, the latest prevention strategies, and even how you can leverage Azure services to detect and defend against them. This will be a three-part series, covering: 🔹Unmasking DDoS Attacks (Part 1): Understanding the Fundamentals and the Attacker’s Playbook What exactly is a DDoS attack, and how does an attacker plan and execute one? In this post, we’ll cover the fundamentals of DDoS attacks, explore the attacker’s perspective, and break down how an attack is crafted and launched. We’ll also discuss the different categories of DDoS attacks and how attackers choose which strategy to use. 🔹 Unmasking DDoS Attacks (Part 2): Analyzing Known Attack Patterns & Lessons from History DDoS attacks come in many forms, but what are the most common and dangerous attack patterns? In this deep dive, we’ll explore real-world DDoS attack patterns, categorize them based on their impact, and analyze some of the largest and most disruptive DDoS attacks in history. By learning from past attacks, we can better understand how DDoS threats evolve and what security teams can do to prepare. 🔹 Unmasking DDoS Attacks (Part 3): Detection, Mitigation, and the Future of DDoS Defense How do you detect a DDoS attack before it causes damage, and what are the best strategies to mitigate one? In this final post, we’ll explore detection techniques, proactive defense strategies, and real-time mitigation approaches. We’ll also discuss future trends in DDoS attacks and evolving defense mechanisms, ensuring that businesses stay ahead of the ever-changing threat landscape. So, without further ado, let’s jump right into Part 1 and start unraveling the world of DDoS attacks. What is a DDoS Attack? A Denial-of-Service (DoS) attack is like an internet traffic jam, but on purpose. It’s when attackers flood a website or online service with so much junk traffic that it slows down, crashes, or becomes completely unreachable for real users. Back in the early days of the internet, pulling off a DoS attack was relatively simple. Servers were smaller, and a single computer (or maybe a handful) could send enough malicious requests to take down a website. But as technology advanced and cloud computing took over, that approach stopped being effective. Today’s online services run on massive, distributed cloud networks, making them way more resilient. So, what did attackers do? They leveled up. Instead of relying on just one machine, they started using hundreds, thousands, or even millions—all spread out across the internet. These attacks became "distributed", with waves of traffic coming from multiple sources at once. And that’s how DDoS (Distributed Denial-of-Service) attacks were born. Instead of a single attacker, imagine a botnet—an army of compromised devices (anything from hacked computers to unsecured IoT gadgets)—all working together to flood a target with traffic. The result? Even the most powerful servers can struggle to stay online. In short, a DDoS attack is just a bigger, badder version of a DoS attack, built for the modern internet. And with cloud computing making things harder to take down, attackers have only gotten more creative in their methods. An Evolving Threat Landscape As recently reported by Microsoft: “DDoS attacks are happening more frequently and on a larger scale than ever before. In fact, the world has seen almost a 300 percent increase in these types of attacks year over year, and it’s only expected to get worse [link]". Orchestrating large-scale DDoS botnets attacks are inexpensive for attackers and are often powered by leveraging compromised devices (i.e., security cameras, home routers, cable modems, IoT devices, etc.). Within the last 6 months alone, our competitors have reported the following: June 2023: Waves of L7 attacks on various Microsoft properties March 2023: Akamai – 900 Gbps DDoS Attack Feb 2023: Cloudflare mitigates record-breaking 71 million request-per-second DDoS attack August 2022: How Google Cloud blocked the largest Layer 7 DDoS attack at 46 million rps Graphs below are F5 labs report. Figure 1 Recent trends indicate that Technology sector is one of the most targeted segments along with Finance and Government Figure 2 Attacks are evolving & a large % of attacks are upgrading to Application DDoS or a multi-vector attack As the DDoS attacks gets bigger and more sophisticated, we need to take a defense-in-depth approach, to protect our customers in every step of the way. Azure services like Azure Front Door, Azure WAF and Azure DDoS are all working on various strategies to counter these emerging DDoS attack patterns. We will cover more on how to effectively use these services to protect your services hosted on Azure in part-3. Understanding DDoS Attacks: The Attacker's Perspective There can be many motivations behind a DDoS attack, ranging from simple mischief to financial gain, political activism, or even cyber warfare. But launching a successful DDoS attack isn’t just about flooding a website with traffic—it requires careful planning, multiple test runs, and a deep understanding of how the target’s infrastructure operates. So, what does it actually mean to bring down a service? It means pushing one or more critical resources past their breaking point—until the system grinds to a halt, becomes unresponsive, or outright collapses under the pressure. Whether it’s choking the network, exhausting compute power, or overloading application processes, the goal is simple: make the service so overwhelmed that legitimate users can’t access it at all. Resources Targeted During an Attack Network Capacity (Bandwidth and Infrastructure): The most common resource targeted in a DDoS attack, the goal is to consume all available network capacity, thereby preventing legitimate requests from getting through. This includes overwhelming routers, switches, and firewalls with excessive traffic, causing them to fail. Processing Power: By inundating a server with more requests than it can process, an attacker can cause it to slow down or even crash, denying service to legitimate users. Memory: Attackers might attempt to exhaust the server's memory capacity, causing degradation in service or outright failure. Disk Space and I/O Operations: An attacker could aim to consume the server's storage capacity or overwhelm its disk I/O operations, resulting in slowed system performance or denial of service. Connection-based Resources: In this type of attack, the resources that manage connections, such as sockets, ports, file descriptors, and connection tables in networking devices, are targeted. Overwhelming these resources can cause a disruption of service for legitimate users. Application Functionality: Specific functions of a web application can be targeted to cause a denial of service. For instance, if a web application has a particularly resource-intensive operation, an attacker may repeatedly request this operation to exhaust the server's resources. DNS Servers: A DNS server can be targeted to disrupt the resolution of domain names to IP addresses, effectively making the web services inaccessible to users. Zero-Day Vulnerabilities: Attackers often exploit unknown or zero-day vulnerabilities in applications or the network infrastructure as part of their attack strategy. Since these vulnerabilities are not yet known to the vendor, no patch is available, making them an attractive target for attackers. CDN Cache Bypass – HTTP flood attack bypasses the web application caching system that helps manage server load. Crafting The Attack Plan Most modern services no longer run on a single machine in someone’s basement—they are hosted on cloud providers with auto-scaling capabilities and vast network capacity. While this makes them more resilient, it does not make them invulnerable. Auto-scaling has its limits, and cloud networks are shared among millions of customers, meaning attackers can still find ways to overwhelm them. When planning a DDoS attack, attackers first analyze the target’s infrastructure to identify potential weaknesses. They then select an attack strategy designed to exploit those weak points as efficiently as possible. Different DDoS attack types target different resources and have unique characteristics. Broadly, these attack strategies can be categorized into three main types: Volumetric Attacks For volumetric attacks, the attacker’s goal is to saturate the target’s system resources by generating a high volume of traffic. To weaponize this attack, attackers usually employ botnets or compromised systems or even use other cloud providers (paid or fraudulently) to generate a large volume of traffic. The traffic is directed towards the target's network, making it difficult for legitimate traffic to reach the services. Examples: SYN Flood, UDP Flood, ICMP Flood, DNS Flood, HTTP Flood. Amplification Attacks Amplification attacks are a cunning tactic where attackers seek to maximize the impact of their actions without expending significant resources. Through crafty exploitation of vulnerabilities or features in systems, such as using reflection-based methods or taking advantage of application-level weaknesses, they make small queries or requests that produce disproportionately large responses or resource consumption on the target's side. Examples: DNS Amplification, NTP Amplification, Memcached Reflection Low and Slow Attacks Non-volumetric exhaustion attacks focus on depleting specific resources within a system or network rather than inundating it with sheer volume of traffic. By exploiting inherent limitations or design aspects, these attacks selectively target elements such as connection tables, CPU, or memory, leading to resource exhaustion without the need for high volume of traffic, making this a very attractive strategy for attackers. Attacks, such as Slowloris and RUDY, subtly deplete server resources like connections or CPU by mimicking legitimate traffic, making them difficult to detect. Examples: Slowloris, R-U-Dead-Yet? (RUDY). Vulnerability-Based Attacks Instead of relying on sheer traffic volume, these attacks exploit known vulnerabilities in software or services. The goal isn’t just to overwhelm resources but to crash, freeze, or destabilize a system by taking advantage of flaws in how it processes certain inputs. This type of attack is arguably the hardest to craft because it requires deep knowledge of the technology stack a service is running on. Attackers must painstakingly research software versions, configurations, and known vulnerabilities, then carefully craft malicious “poison pill” requests designed to trigger a failure. It’s a game of trial and error, often requiring multiple test runs before finding a request that successfully brings down the system. It’s also one of the most difficult attacks to defend against. Unlike volumetric attacks, which flood a service with traffic that security tools can detect, a vulnerability-based attack can cause a software crash so severe that it prevents the system from even generating logs or attack traffic metrics. Without visibility into what happened, detection and mitigation become incredibly challenging. Examples: Apache Killer, Log4Shell Executing The Attack Now that an attacker has finalized their attack strategy and identified which resource(s) to exhaust, they still need a way to execute the attack. They need the right tools and infrastructure to generate the overwhelming force required to bring a target down. Attackers have multiple options depending on their technical skills, resources, and objectives: Booters & Stressers – Renting attack power from popular botnets. Amplification attacks – Leveraging publicly available services (like DNS or NTP servers) to amplify attack traffic. Cloud abuse – Hijacking cloud VMs or misusing free-tier compute resources to generate attacks. But when it comes to executing large-scale, persistent, and devastating DDoS attacks, one method stands above the rest: botnets. Botnets: The Powerhouse Behind Modern DDoS Attacks A botnet is a network of compromised devices—computers, IoT gadgets, cloud servers, and even smartphones—all controlled by an attacker. These infected devices (known as bots or zombies) remain unnoticed by their owners while quietly waiting for attack commands. Botnets revolutionized DDoS attacks, making them: Massive in scale – Some botnets include millions of infected devices, generating terabits of attack traffic. Hard to block – Since the traffic comes from real, infected machines, it’s difficult to filter out malicious requests. Resilient – Even if some bots are shut down, the remaining network continues the attack. But how do attackers build, control, and launch a botnet-driven DDoS attack? The secret lies in Command and Control (C2) systems. How a Botnet Works: Inside the Attacker’s Playbook Infecting Devices: Building the Army Attackers spread malware through phishing emails, malicious downloads, unsecured APIs, or IoT vulnerabilities. Once infected, a device becomes a bot, silently connecting to the botnet's network. IoT devices (smart cameras, routers, smart TVs) are especially vulnerable due to poor security. Command & Control (C2) – The Brain of the Botnet A botnet needs a Command & Control (C2) server, which acts as its central command center. The attacker sends instructions through the C2 server, telling bots when, where, and how to attack. Types of C2 models: Centralized C2 – A single server controls all bots (easier to attack but simpler to manage). Peer-to-Peer (P2P) C2 – Bots communicate among themselves, making takedowns much harder. Fast Flux C2 – C2 infrastructure constantly changes IP addresses to avoid detection. Launching the Attack: Overwhelming the Target When the attacker gives the signal, the botnet unleashes the attack. Bots flood the target with traffic, connection requests, or amplification exploits. Since the traffic comes from thousands of real, infected devices, distinguishing attackers from normal users is extremely difficult. Botnets use encryption, proxy networks, and C2 obfuscation to stay online. Some botnets use hijacked cloud servers to further hide their origins. Famous Botnets & Their Impact Mirai (2016) – One of the most infamous botnets, Mirai infected IoT devices to launch a 1.2 Tbps DDoS attack, taking down Dyn DNS and causing major outages across Twitter, Netflix, and Reddit. Mozi (2020-Present) – A peer-to-peer botnet with millions of IoT bots worldwide. Meris (2021) – Hit 2.5 million RPS (requests per second), setting records for application-layer attacks. Botnets have transformed DDoS attacks, making them larger, harder to stop, and widely available on the dark web. With billions of internet-connected devices, botnets are only growing in size and sophistication. We will cover strategies on botnet detection and mitigations employed by Azure Front Door and Azure WAF services against such large DDoS attacks. Wrapping Up Part-1 With that, we’ve come to the end of Part 1 of our Unmasking DDoS Attacks series. To summarize, we’ve covered: ✅ The fundamentals of DDoS attacks—what they are and why they’re dangerous. ✅ The different categories of DDoS attacks—understanding how they overwhelm resources. ✅ The attacker’s perspective—how DDoS attacks are planned, strategized, and executed. ✅ The role of botnets—why they are the most powerful tool for large-scale attacks. This foundational knowledge is critical to understanding the bigger picture of DDoS threats—but there’s still more to uncover. Stay tuned for Part 2, where we’ll dive deeper into well-known DDoS attack patterns, examine some of the biggest DDoS incidents in history, and explore what lessons we can learn from past attacks to better prepare for the future. See you in Part 2!810Views2likes0Comments