high availability
280 TopicsAnnouncing Azure Infrastructure Resiliency Manager Public Preview
At Microsoft Build 2026, we are thrilled to announce that Azure Infrastructure Resiliency Manager is now available in public preview, open to all Azure customers. Azure Infrastructure Resiliency Manager is not a replacement for individual Azure resiliency features; it is the unifying layer that connects them into a coherent, goal-driven workflow. It leverages and complements Availability Zones, Azure Advisor, Azure Chaos Studio, Azure Monitor, and Azure Copilot, adding purposeful orchestration that turns isolated capabilities into a complete resiliency strategy. The preview already covers a broad range of Azure resource types and zone-redundant configurations, from virtual machines and databases to AKS clusters and networking with continued expansion planned. The new platform is built on a foundational belief: achieving application resilience is a continuous journey, not a one-time configuration task. That journey is organized into three actionable phases: Start Resilient, Get Resilient, and Stay Resilient. Each phase delivers measurable customer value such as reduced downtime risk, faster recovery, and greater operational confidence. Start resilient: Embedding resiliency from day one Starting resilient means treating resiliency as a fundamental architectural requirement, not an afterthought. Azure Infrastructure Resiliency Manager makes it straightforward to design zone-resilient applications from the outset, eliminating costly retrofits and reducing risk before your first deployment. Resiliency Agent: Your AI-powered architecture advisor The standout capability in this preview is the Resiliency Agent, a conversational, AI-powered assistant embedded directly in the Azure Portal. Designed for architects and developers, the Resiliency Agent allows teams to validate and refine resiliency strategies using plain language. For example, you might enter a prompt such as "I'm designing a three-tier web app with VMs, a Flexible PostgreSQL database, and a Standard Load Balancer" and ask the agent what zone-resiliency requirements apply. The Resiliency Agent analyzes your plan, identifies single points of failure, and recommends specific changes: enabling zone redundancy for the database, deploying VMs across zones, or upgrading to zone-redundant load balancers. It delivers a structured, per-resource summary that makes the path to resiliency explicit and actionable. Infrastructure-as-Code generation and validation Beyond design guidance, Infrastructure Resiliency Manager accelerates implementation. You can ask the Resiliency Agent to generate Infrastructure-as-Code (IaC) templates (ARM, Bicep, or Terraform) with all resiliency configurations pre-built and ready to deploy. A generated Bicep template, for example, automatically includes zone-redundant settings for databases, VMs, and load balancers aligned to your stated goals. The agent also validates existing IaC templates: upload a template and receive a natural language assessment of resiliency gaps, complete with targeted suggestions and code snippets to close them. This eliminates manual review overhead and ensures every new deployment starts with a resilient foundation by embedding resiliency into the design and deployment lifecycle from day one, organizations avoid expensive redesigns, accelerate time-to-market, and bring new services to production already meeting high-availability standards. Get resilient: Closing gaps in existing applications Most Azure customers have workloads built over months or years that may not fully meet today's resiliency requirements. Infrastructure Resiliency Manager delivers a centralized, goal-driven view of your current environment's resilience posture, along with prioritized, actionable recommendations to close every gap. Goal-driven resiliency posture Define what constitutes your application by grouping resources across regions, subscriptions, or resource groups, including tag-based grouping, using Service Groups. Once your application boundary is established, assign a resiliency goal: for example, zone-failure tolerance for all components, or specific data replication requirements for critical services. The platform assesses every resource against that goal and presents a clear, single-pane-of-glass resiliency posture showing which resources meet the goal, which are non-resilient, and which remain unevaluated. This goal-driven model ensures that all subsequent guidance is precisely calibrated to your target state, not generic best practices. Actionable, prioritized recommendations For every resource that falls short of the defined goal, Infrastructure Resiliency Manager generates targeted remediation recommendations powered by Azure Advisor. If a virtual machine lacks zone redundancy, the platform recommends converting it to an availability zone deployment. If a database is not zone-redundant, the recommendation specifies exactly how to enable it. Critically, every recommendation includes contextual decision-making information: impacted resources, implementation steps, and qualitative cost indicators (High, Medium, Low) that flag whether a fix requires additional service spend, downtime, or redeployment. This allows engineering teams to plan remediation in a business-informed, prioritized manner. Looking ahead, the platform will also integrate application health with infrastructure health, correlating Azure Monitor SLIs and Azure Health Model insights to surface resiliency gaps with even greater precision. Guided remediation with the resiliency agent Azure Advisor identifies resiliency gaps and surfaces prioritized recommendations. Infrastructure Resiliency Manager builds on this by making those recommendations actionable. Instead of stopping at insights, the platform provides guided execution. Each recommendation includes step-by-step portal flows, dependencies, and readiness checks required for remediation. The Resiliency Agent acts as the interactive layer on top, helping you interpret and act on these recommendations in context. For example, you can ask whether an App Service can be moved to zone-redundant storage, what downtime to expect, or what prerequisites are required and receive clear, workload-aware answers tailored to their environment. On request, the agent can generate remediation scripts or IaC snippets to implement specific changes, such as validating an existing Terraform template against Azure resiliency best practices. Importantly, the agent never makes changes autonomously: it provides information and code, while you retain full control over execution. This human-in-the-loop model accelerates remediation without sacrificing governance. The result: a curated, goal-oriented to-do list that replaces generic advice with targeted action, weighted by cost and feasibility - giving engineering leaders clear visibility into which investments will yield the greatest resilience gains. Stay resilient: Continuous validation and recovery Readiness Resilience is not just a configuration milestone; it is an ongoing operational discipline. The "Stay Resilient" phase ensures the resilience you've built performs under pressure and that your teams are prepared to respond when real incidents occur. Azure Infrastructure Resiliency Manager delivers resiliency drills and recovery orchestration to support continuous readiness. Resiliency drills enabled by Azure Chaos Studio A highlight of this public preview is the introduction of availability zone failure drills, enabled by Azure Chaos Studio. These drills simulate zone outages for your application in a controlled, safe environment: shutting down VMs in a target availability zone, forcing failover for zone-redundant databases, or stopping AKS node pools. Every fault action is based on Azure-recommended patterns for each supported resource type, providing a realistic approximation of an actual zone failure. Because Infrastructure Resiliency Manager understands which resources are intended to be zone-resilient, it automatically determines which fault actions to apply, eliminating manual configuration. For scenarios not covered out of the box, custom fault logic via Azure Automation runbooks is supported, providing the flexibility required for complex environments. Recovery orchestration Resiliency drills in the platform go beyond fault injection. It integrates with recovery plan to orchestrate the complete recovery sequence automatically after injecting faults: fault injection → failover → reprotection → failback. This full-cycle simulation measures the maximum potential downtime your application could experience during a zone outage and surfaces any recovery steps that did not execute as expected. Real-time health monitoring and drill insights Throughout each drill, the Infrastructure Resiliency Manager provides live health monitoring powered by Azure Monitor. A built-in metrics dashboard tracks each resource's health in real time revealing whether your application remains available and how performance holds under simulated stress. This immediate feedback surfaces resilience gaps that may not have been visible through static analysis. After each drill, the platform logs the results along with team notes and attestations, building a historical record of all resilience tests. Over time, this record demonstrates measurable improvement and supports compliance with organizational and regulatory resiliency requirements. "Stay Resilient" converts assumptions into evidence. When an actual zone outage occurs, your teams will not be executing a failover for the first time; they would have rehearsed it. The result is a culture of proactive resilience, and the organizational confidence that your systems will deliver on their availability commitments. Get started with the public preview Starting today, the public preview of Azure Infrastructure Resiliency Manager is open to all Azure customers. Access the new platform through the Azure Portal by searching for "Resiliency". We encourage you to evaluate it against a test application or a production workload to gain immediate visibility into your current resiliency posture. To get the most from Infrastructure Resiliency Manager, we recommend these three starting actions: Define a resiliency goal for a critical application and review the posture insights the platform surfaces; you may uncover gaps that were previously invisible. Engage the Resiliency Agent to tackle a few recommendations and experience firsthand how AI-guided remediation accelerates your team's workflow. Run a zone-down drill in a non-production environment to validate your failover and recovery processes under realistic conditions. We believe this holistic approach will help organizations achieve a new level of operational excellence, making resiliency actionable, measurable, and deeply embedded in cloud practices. As Infrastructure Resiliency Manager moves toward general availability, we will continue incorporating your feedback and expanding capabilities to meet the demands of real-world cloud architectures. Azure Infrastructure Resiliency Manager gives you the tools to reduce downtime risk, gain clarity over your resiliency posture, and build genuine readiness for the unexpected. Join the public preview today and take the next step toward applications that don't just survive disruptions; they thrive through them. Resources Azure Infrastructure Resiliency Manager — Overview Get Started with Service Groups — Microsoft Learn Introduction to Azure Advisor — Microsoft Learn What is Azure Chaos Studio? — Microsoft Learn What's New in Azure Monitor — Microsoft Learn Modern Azure Resilience with Mark Russinovich — Tech Community1.2KViews5likes0CommentsDetermine Availability Group Synchronization State, Minimize Data Loss When Quorum is Forced
First published on MSDN on Nov 11, 2014 When Windows Cluster quorum is lost either due to a short term network issue, or a disaster causes long term down time for the server that hosted your primary replica, and forcing quorum is required in order to quickly bring your availability group resource online, a number of circumstances should be considered to eliminate or reduce data loss.9.1KViews0likes0CommentsBest practices for safely performing schema changes in Azure Database for MySQL
Azure Database for MySQL - Flexible Server is built on the open-source MySQL database engine, and the service supports MySQL 8.0 and newer versions. This means that users can take advantage of the flexibility and advanced capabilities of MySQL’s latest features while benefitting from a fully managed database service. While newer versions and features can provide a lot of value, the recent issues identified with MySQL versions 8.0+ makes it important to be aware of potential risks that can occur during certain operations, particularly if you are making online schema changes. Issues with data loss and duplicate keys with Online DDL Online Data Definition Language (DDL) operations are a powerful feature in MySQL, enabling schema changes like ALTER TABLE or OPTIMIZE TABLE with minimal impact on table availability. These operations are designed to reduce downtime by allowing concurrent reads and writes during schema modifications, making them an essential tool for managing active databases efficiently. However, a recent post on the Percona blog, Who Ate My MySQL Table Rows? highlights critical risks associated with MySQL 8.0.x versions after 8.0.27 and all versions beyond 8.4.y. Specifically, the open-source INPLACE algorithm, commonly used for online schema changes, can lead to data loss and duplicate key errors under certain conditions. These issues arise from constraints in the INPLACE algorithm, particularly during ALTER TABLE and OPTIMIZE TABLE operations, exposing vulnerabilities that compromise data integrity and system reliability. These risks are called out in the following bug reports: Bug #115511: Data loss during online ALTER operations with concurrent DML Bug #115608: Duplicate key errors caused by online ALTER operations Documented issues related to the INPLACE algorithm (used for online DDL) can cause: Data Loss: Rows may be accidentally deleted or become inaccessible. Duplicate Keys: Indexes can end up with duplicate entries, leading to data consistency issues and potential replication errors. Problems arise when INPLACE operations, such as ALTER TABLE or OPTIMIZE TABLE, run concurrently with: DML operations (INSERT, UPDATE, DELETE): Modifications to table data during the rebuild. A purge activity: Background cleanup operations for old row versions in InnoDB. These scenarios can lead to anomalies resulting from race conditions and incomplete synchronization between concurrent activities. Impact on Azure Database for MySQL - Flexible Server Customers For Azure Database for MySQL Flexible Server customers using MySQL 8.0+ and all versions after 8.4.y, this issue is particularly critical as it affects: Data Integrity: During schema changes such as ALTER TABLE or OPTIMIZE TABLE run using the INPLACE algorithm, data rows may be lost or duplicated if these operations run concurrently with a DML activity (e.g., INSERT, UPDATE, or DELETE) or background purge tasks. This can compromise the accuracy and reliability of the database, potentially leading to incorrect query results or the loss of critical business data. Replication Instability: Duplicate keys or missing rows can interrupt replication processes, which rely on a consistent data stream across the primary and replica servers. These issues can arise when there are concurrent insertions into the table during schema changes, leading to data inconsistencies between the primary and replicas. Such inconsistencies may result in replication lag, errors, or even a complete breakdown of high-availability setups, requiring manual intervention to restore synchronization. Operational Downtime: Resolving these issues often involves manually syncing data or restoring backups. These recovery efforts can be time-consuming and disruptive, leading to extended downtime for applications and potential business impact. Recommendations for safe schema changes on Azure Database for MySQL flexible servers To minimize the risks of data loss and duplicate keys while making schema changes, follow these best practices: Set old_alter_table=ON to Default to COPY Algorithm Enable the server parameter old_alter_table system variable so that ALTER TABLE operations without a specified ALGORITHM default to using the COPY algorithm instead of INPLACE. This reduces the risk for users who do not explicitly specify the ALGORITHM in their commands. Learn more on how configure server parameters in Azure Database for MySQL. Avoid using ALGORITHM=INPLACE Do not explicitly use ALGORITHM=INPLACE for ALTER TABLE commands, as it increases the risk of data loss or duplicate keys. Back up your data before schema changes Always perform a full on-demand backup of your server before executing schema changes. This precaution ensures data recoverability in case of unexpected issues. Learn more on how to take full on-demand backups for your server. Avoid Concurrent DML during schema changes Schedule schema changes like ALTER TABLE and OPTIMIZE TABLE during application maintenance windows when no concurrent writes activities occur. This minimizes race conditions and synchronization conflicts. Use External Tools for Safer Online Schema Changes Consider using external tools like pt-online-schema-change to modify table definitions without blocking concurrent changes. These tools enable you to make schema changes with minimal impact on availability and performance. Learn more about pt-online-schema-change. Disclaimer: The pt-online-schema-change tool is not managed or supported by Microsoft; use it at your discretion. Mitigation plans To address these risks, we’re actively working to integrate the necessary fixes to ensure a more robust and reliable experience for our customers. New Servers Fully Secured by End of February 2025 All new Azure Database for MySQL Flexible Server instances created after 1 st March 2025, will include the latest fixes, ensuring that schema changes are safeguarded against data loss and duplicate key risks. Rollout for Existing Servers For existing servers, we will roll out patches during upcoming maintenance windows by end of Q1 of Calendar Year 2025 We recommend monitoring your Azure portal for scheduled maintenance windows and Release notes for announcements about critical updates and patches. Priority updates available upon request If you require an urgent update outside of the scheduled maintenance windows, you can contact Azure Support. Provide the necessary server details and an appropriate maintenance window, and our team will work with you to prioritize the patching process. Note that priority patching will be available by February 2025. We recommend monitoring Release notes for announcements about critical updates and patches. Conclusion Safely managing schema changes on MySQL servers requires understanding the risks associated with online DDL operations, such as potential data loss and duplicate keys. To help safeguard data integrity and maintain server stability, implement best practices, for example enabling the COPY algorithm, using offline operations if feasible, or scheduling changes during low activity periods. Fixes are expected by the end of February 2025, and new Azure Database for MySQL flexible servers will be fully protected against these bugs. We will apply updates to existing servers during maintenance windows in Q1 2025. Following the recommendations above will help ensure that you can confidently make schema changes while preserving the reliability and performance of your server.1.1KViews0likes6CommentsBuilding a Restaurant Management System with Azure Database for MySQL
In this hands-on tutorial, we'll build a Restaurant Management System using Azure Database for MySQL. This project is perfect for beginners looking to understand cloud databases while creating something practical.1.3KViews5likes5CommentsAlways On - Synchronize SAP login, jobs and objects
SQL Server AlwaysOn is one of the High Availability solutions available for an SAP system. It consists of two or more computers each hosting a SQL Server with a copy of the SAP database. A listener points to the actual primary copy and is used from the SAP system as the only connection point. For details how to setup and configure an SAP system together with SQL Server AlwaysOn see this blog post and its referenced blog posts. During the setup the SAP System is configured from the current primary node and all non-database related objects such as SQL Server Agent Jobs, logins etc. are created only on the current primary database. In a case of a (automatic) failover to one of the secondary nodes of AlwaysOn these objects are then missing. Jürgen has introduced a script (sap_helprevlogin) in his initial blog post about the database load after setting up AlwaysOn. This script will transfer only the logins, but will fall short on transferring jobs, server level permissions and other assignments. One of the SAP developers working in our team has built a comprehensive PowerShell script (sap_synchronize_always_on.ps1) to perform all these tasks and to transfer all the SAP objects from the initial installation to all the other nodes of the AlwaysOn system. The script connects to the primary instance, reads the configuration of the secondary nodes and then synchronizes the objects and jobs with these nodes. The script must be executed by a domain administrator which has SQL Server sysadmin privileges on all AlwaysOn instances. The script uses up to three input variables: The server name of the SQL Server instance or the listener name of the High-Availability group. The default is (local) The name of the SAP database, which must be in an High-Availability group on the given server Single login (optional): Only one login gets copied along with SAP CCMS jobs owned by the login. By default all logins mapped to the database are copied. The script will execute: Create a procedure CheckAccess in the master database (see this blog about the details about it) Discover which logins are mapped to the database Discover which SAP CCMS jobs belong to those logins If the job does not use CheckAccess then change the job step to use CheckAccess and run the job step in master Open a connection to each secondary and: Create procedure CheckAccess in the master database Create the logins if they don't exist already using the same sid. Create the jobs if they don't exist already. If a job exists and if the job does not CheckAccess then change the job step to use CheckAccess and run in master If new SAP CCMS jobs are added because of remote monitoring from a different SAP system using DBACOCKPIT, the script can be re-executed. It will then copy only new objects which have not been copied before. You can find this useful script attached, which makes the synchronization of the SAP Systems in an AlwaysOn environment so much easier. Please ensure that you test the execution in your test environment first, before you run it in production. Neither SAP nor Microsoft takes any responsibility from using this script, you run it on your own risk. Update January 2017: New script version that copies the sid<adm> and SAPService<SID> logins from the SAP System as well. Best regards | Bless! Clas & GuðmundurAvailability Group Database Reports Not Synchronizing / Recovery Pending After Database Log File Inaccessible
First published on MSDN on Nov 29, 2017 You may find that one or more availability group databases is reported ‘Not Synchronizing / Recovery Pending’ on the primary replica or ‘Not Synchronizing’ on one of the secondary replicas.125KViews1like0CommentsIgnite 2022: Announcing new features in Azure Database for MySQL – Flexible Server
Today, we're pleased to announce a set of new and exciting features in Azure Database for MySQL - Flexible Server that further improve the service's availability, performance, security, management, and developer experiences!6.9KViews3likes0CommentsSQL Server manages Preferred and Possible Owner Properties for AlwaysOn Availability Group/Role
First published on MSDN on Feb 28, 2014 As a clustered resource, the availability group clustered resource/role have configurable cluster properties, like possible owners and preferred owners.20KViews0likes0CommentsAlwaysON - HADRON Learning Series: Worker Pool Usage for HADRON Enabled Databases
First published on MSDN on May 17, 2012 I am on several e-mail aliases related to Always On databases (reference Availability Group, AG, HADRON) and the question of worker thread usage is a hot topic this week.2.4KViews0likes0Comments