Apps on Azure Blog

3 MIN READ

MCP-Driven Azure SRE for Databricks

varghesejoji

Microsoft

Feb 12, 2026

Automate Databricks compliance and incident response using the Model Context Protocol - an open standard for AI agents to call external tools and APIs.

Azure SRE Agent is an AI-powered operations assistant built for incident response and governance. MCP (Model Context Protocol) is the standard interface it uses to connect to external systems and tools.

Azure SRE Agent integrates with Azure Databricks through the Model Context Protocol (MCP) to provide:

Proactive Compliance - Automated best practice validation
Reactive Troubleshooting - Root cause analysis and remediation for job failures

This post demonstrates both capabilities with real examples.

Architecture

The Azure SRE Agent orchestrates Ops Skills and Knowledge Base prompts, then calls the Databricks MCP server over HTTPS. The MCP server translates those requests into Databricks REST API calls, returns structured results, and the agent composes findings, evidence, and remediation. End-to-end, this yields a single loop: intent -> MCP tool calls -> Databricks state -> grounded response.

Deployment

The MCP server runs as a containerized FastMCP application on Azure Container Apps, fronted by HTTPS and configured with Databricks workspace connection settings. It exposes a tool catalog that the agent invokes through MCP, while the container handles authentication and REST API calls to Databricks.

👉 For deployment instructions, see the GitHub repository.

Getting Started

Deploy the MCP Server: Follow the quickstart guide to deploy to Azure Container Apps (~30 min)

Configure Azure SRE Agent:

Create MCP connector with streamable-http transport
Upload Knowledge Base from Builder > Knowledge Base using the Best Practices doc: AZURE_DATABRICKS_BEST_PRACTICES.md
- Benefit: Gives the agent authoritative compliance criteria and remediation commands.
Create Ops Skill from Builder > Subagent Builder > Create skill and drop the Ops Skill doc: DATABRICKS_OPS_RUNBOOK_SKILL.md
- Benefit: Adds incident timelines, runbooks, and escalation triggers to responses.
Deploy the subagent YAML: Databricks_MCP_Agent.yaml
- Benefit: Wires the MCP connector, Knowledge Base, and Ops Skill into one agent for proactive and reactive workflows.

Integrate with Alerting:

Connect PagerDuty/ServiceNow webhooks
Enable auto-remediation for common issues

Part 1: Proactive Compliance

Use Case: Best Practice Validation

Prompt:

@Databricks_MCP_Agent: Validate the Databricks workspace for best practices compliance and provide a summary, detailed findings, and concrete remediation steps.

What the Agent Does:

Calls MCP tools to gather current state:

list_clusters() - Audit compute configurations
list_catalogs() - Check Unity Catalog setup
list_jobs() - Review job configurations
execute_sql() - Query governance policies

Cross-references findings with Knowledge Base (best practices document)

Generates prioritized compliance report

Expected Output:

Benefits:

Time Savings: 5 minutes vs. 2-3 hours manual review
Consistency: Same validation criteria across all workspaces
Actionable: Specific remediation steps with code examples

Part 2: Reactive Incident Response

Example 1: Job Failure - Non-Zero Exit Code

Scenario: Job job_exceptioning_out fails repeatedly due to notebook code errors.

Prompt:

Agent Investigation - Calls MCP Tools:

get_job() - Retrieves job definition
list_job_runs() - Gets recent run history (4 failed runs)
get_run_output() - Analyzes error logs

Root Cause Analysis:

Expected Outcome:

Root Cause Identified: sys.exit(1) in notebook code
Evidence Provided: Job ID, run history, code excerpt, settings
Confidence: HIGH (explicit failing code present)
Remediation: Fix code + add retry policy
Resolution Time: 3-5 minutes (vs. 30-45 minutes manual investigation)

Example 2: Job Failure - Task Notebook Exception

Scenario: Job hourly-data-sync fails repeatedly due to exception in task notebook.

Prompt:

Agent Investigation - Calls MCP Tools:

get_job() - Job definition and task configuration
list_job_runs() - Recent runs show "TERMINATED with TIMEOUT"
execute_sql() - Queries notebook metadata

Root Cause Analysis:

Expected Outcome:

Root Cause Identified: Exception at line 7 - null partition detected
Evidence: Notebook path, code excerpt (lines 5-7), run history (7 consecutive failures)
Confidence: HIGH (explicit failing code + TIMEOUT/queue issues)
Remediation: Fix exception handling + add retry policy
Resolution Time: 5-8 minutes (vs. 45+ minutes manual log analysis)

Key Benefits

Proactive Governance

✅ Continuous compliance monitoring
✅ Automated best practice validation
✅ 95% reduction in manual review time

Reactive Incident Response

🚨 Automated root cause analysis
⚡ 80-95% reduction in MTTR
🧠 Context-aware remediation recommendations
📊 Evidence-based troubleshooting

Operational Impact

Metric	Before	After	Improvement
Compliance review time	2-3 hours	5 minutes	95%
Job failure investigation	30-45 min	3-8 min	85%
On-call alerts requiring intervention	4-6 per shift	1-2 per shift	70%