Blog Post

Apps on Azure Blog
3 MIN READ

MCP-Driven Azure SRE for Databricks

varghesejoji's avatar
varghesejoji
Icon for Microsoft rankMicrosoft
Feb 12, 2026

Automate Databricks compliance and incident response using the Model Context Protocol - an open standard for AI agents to call external tools and APIs.

Azure SRE Agent is an AI-powered operations assistant built for incident response and governance. MCP (Model Context Protocol) is the standard interface it uses to connect to external systems and tools.

Azure SRE Agent integrates with Azure Databricks through the Model Context Protocol (MCP) to provide:

  1. Proactive Compliance - Automated best practice validation
  2. Reactive Troubleshooting - Root cause analysis and remediation for job failures

This post demonstrates both capabilities with real examples.

Architecture

The Azure SRE Agent orchestrates Ops Skills and Knowledge Base prompts, then calls the Databricks MCP server over HTTPS. The MCP server translates those requests into Databricks REST API calls, returns structured results, and the agent composes findings, evidence, and remediation. End-to-end, this yields a single loop: intent -> MCP tool calls -> Databricks state -> grounded response.

Deployment

The MCP server runs as a containerized FastMCP application on Azure Container Apps, fronted by HTTPS and configured with Databricks workspace connection settings. It exposes a tool catalog that the agent invokes through MCP, while the container handles authentication and REST API calls to Databricks.

πŸ‘‰ For deployment instructions, see the GitHub repository.

Getting Started

Deploy the MCP Server: Follow the quickstart guide to deploy to Azure Container Apps (~30 min)

Configure Azure SRE Agent:

  • Create MCP connector with streamable-http transport
  • Upload Knowledge Base from Builder > Knowledge Base using the Best Practices doc: AZURE_DATABRICKS_BEST_PRACTICES.md
    • Benefit: Gives the agent authoritative compliance criteria and remediation commands.
  • Create Ops Skill from Builder > Subagent Builder > Create skill and drop the Ops Skill doc: DATABRICKS_OPS_RUNBOOK_SKILL.md
    • Benefit: Adds incident timelines, runbooks, and escalation triggers to responses.
  • Deploy the subagent YAML: Databricks_MCP_Agent.yaml
    • Benefit: Wires the MCP connector, Knowledge Base, and Ops Skill into one agent for proactive and reactive workflows.

Integrate with Alerting:

  • Connect PagerDuty/ServiceNow webhooks
  • Enable auto-remediation for common issues

Part 1: Proactive Compliance

Use Case: Best Practice Validation

Prompt:

@Databricks_MCP_Agent: Validate the Databricks workspace for best practices compliance and provide a summary, detailed findings, and concrete remediation steps.

What the Agent Does:

Calls MCP tools to gather current state:

  • list_clusters() - Audit compute configurations
  • list_catalogs() - Check Unity Catalog setup
  • list_jobs() - Review job configurations
  • execute_sql() - Query governance policies

Cross-references findings with Knowledge Base (best practices document)

Generates prioritized compliance report

Expected Output:

Benefits:

  • Time Savings: 5 minutes vs. 2-3 hours manual review
  • Consistency: Same validation criteria across all workspaces
  • Actionable: Specific remediation steps with code examples

Part 2: Reactive Incident Response

Example 1: Job Failure - Non-Zero Exit Code

Scenario: Job job_exceptioning_out fails repeatedly due to notebook code errors.

Prompt:

Agent Investigation - Calls MCP Tools:

  • get_job() - Retrieves job definition
  • list_job_runs() - Gets recent run history (4 failed runs)
  • get_run_output() - Analyzes error logs

Root Cause Analysis:

Expected Outcome:

  • Root Cause Identified: sys.exit(1) in notebook code
  • Evidence Provided: Job ID, run history, code excerpt, settings
  • Confidence: HIGH (explicit failing code present)
  • Remediation: Fix code + add retry policy
  • Resolution Time: 3-5 minutes (vs. 30-45 minutes manual investigation)

Example 2: Job Failure - Task Notebook Exception

Scenario: Job hourly-data-sync fails repeatedly due to exception in task notebook.

Prompt:

 

 

 

Agent Investigation - Calls MCP Tools:

  • get_job() - Job definition and task configuration
  • list_job_runs() - Recent runs show "TERMINATED with TIMEOUT"
  • execute_sql() - Queries notebook metadata

Root Cause Analysis:

Expected Outcome:

  • Root Cause Identified: Exception at line 7 - null partition detected
  • Evidence: Notebook path, code excerpt (lines 5-7), run history (7 consecutive failures)
  • Confidence: HIGH (explicit failing code + TIMEOUT/queue issues)
  • Remediation: Fix exception handling + add retry policy
  • Resolution Time: 5-8 minutes (vs. 45+ minutes manual log analysis)

Key Benefits

Proactive Governance

  • βœ… Continuous compliance monitoring
  • βœ… Automated best practice validation
  • βœ… 95% reduction in manual review time

Reactive Incident Response

  • 🚨 Automated root cause analysis
  • ⚑ 80-95% reduction in MTTR
  • 🧠 Context-aware remediation recommendations
  • πŸ“Š Evidence-based troubleshooting

Operational Impact

MetricBeforeAfterImprovement
Compliance review time2-3 hours5 minutes95%
Job failure investigation30-45 min3-8 min85%
On-call alerts requiring intervention4-6 per shift1-2 per shift70%

Conclusion

Azure SRE Agent transforms Databricks operations by combining proactive governance with reactive troubleshooting. The MCP integration provides:

  • Comprehensive visibility into workspace health
  • Automated compliance monitoring and validation
  • Intelligent incident response with root cause analysis
  • Self-healing capabilities for common failures

Result: Teams spend less time firefighting and more time building.

Resources

Questions? Open an issue on GitHub or reach out to the Azure SRE team.

Updated Feb 12, 2026
Version 3.0
No CommentsBe the first to comment