Automate Databricks compliance and incident response using the Model Context Protocol - an open standard for AI agents to call external tools and APIs.
Azure SRE Agent is an AI-powered operations assistant built for incident response and governance. MCP (Model Context Protocol) is the standard interface it uses to connect to external systems and tools.
Azure SRE Agent integrates with Azure Databricks through the Model Context Protocol (MCP) to provide:
- Proactive Compliance - Automated best practice validation
- Reactive Troubleshooting - Root cause analysis and remediation for job failures
This post demonstrates both capabilities with real examples.
Architecture
The Azure SRE Agent orchestrates Ops Skills and Knowledge Base prompts, then calls the Databricks MCP server over HTTPS. The MCP server translates those requests into Databricks REST API calls, returns structured results, and the agent composes findings, evidence, and remediation. End-to-end, this yields a single loop: intent -> MCP tool calls -> Databricks state -> grounded response.
Deployment
The MCP server runs as a containerized FastMCP application on Azure Container Apps, fronted by HTTPS and configured with Databricks workspace connection settings. It exposes a tool catalog that the agent invokes through MCP, while the container handles authentication and REST API calls to Databricks.
π For deployment instructions, see the GitHub repository.
Getting Started
Deploy the MCP Server: Follow the quickstart guide to deploy to Azure Container Apps (~30 min)
Configure Azure SRE Agent:
- Create MCP connector with streamable-http transport
- Upload Knowledge Base from Builder > Knowledge Base using the Best Practices doc: AZURE_DATABRICKS_BEST_PRACTICES.md
- Benefit: Gives the agent authoritative compliance criteria and remediation commands.
- Create Ops Skill from Builder > Subagent Builder > Create skill and drop the Ops Skill doc: DATABRICKS_OPS_RUNBOOK_SKILL.md
- Benefit: Adds incident timelines, runbooks, and escalation triggers to responses.
- Deploy the subagent YAML: Databricks_MCP_Agent.yaml
- Benefit: Wires the MCP connector, Knowledge Base, and Ops Skill into one agent for proactive and reactive workflows.
Integrate with Alerting:
- Connect PagerDuty/ServiceNow webhooks
- Enable auto-remediation for common issues
Part 1: Proactive Compliance
Use Case: Best Practice Validation
Prompt:
@Databricks_MCP_Agent: Validate the Databricks workspace for best practices compliance and provide a summary, detailed findings, and concrete remediation steps.
What the Agent Does:
Calls MCP tools to gather current state:
- list_clusters() - Audit compute configurations
- list_catalogs() - Check Unity Catalog setup
- list_jobs() - Review job configurations
- execute_sql() - Query governance policies
Cross-references findings with Knowledge Base (best practices document)
Generates prioritized compliance report
Expected Output:
Benefits:
- Time Savings: 5 minutes vs. 2-3 hours manual review
- Consistency: Same validation criteria across all workspaces
- Actionable: Specific remediation steps with code examples
Part 2: Reactive Incident Response
Example 1: Job Failure - Non-Zero Exit Code
Scenario: Job job_exceptioning_out fails repeatedly due to notebook code errors.
Prompt:
Agent Investigation - Calls MCP Tools:
- get_job() - Retrieves job definition
- list_job_runs() - Gets recent run history (4 failed runs)
- get_run_output() - Analyzes error logs
Root Cause Analysis:
Expected Outcome:
- Root Cause Identified: sys.exit(1) in notebook code
- Evidence Provided: Job ID, run history, code excerpt, settings
- Confidence: HIGH (explicit failing code present)
- Remediation: Fix code + add retry policy
- Resolution Time: 3-5 minutes (vs. 30-45 minutes manual investigation)
Example 2: Job Failure - Task Notebook Exception
Scenario: Job hourly-data-sync fails repeatedly due to exception in task notebook.
Prompt:
Agent Investigation - Calls MCP Tools:
- get_job() - Job definition and task configuration
- list_job_runs() - Recent runs show "TERMINATED with TIMEOUT"
- execute_sql() - Queries notebook metadata
Root Cause Analysis:
Expected Outcome:
- Root Cause Identified: Exception at line 7 - null partition detected
- Evidence: Notebook path, code excerpt (lines 5-7), run history (7 consecutive failures)
- Confidence: HIGH (explicit failing code + TIMEOUT/queue issues)
- Remediation: Fix exception handling + add retry policy
- Resolution Time: 5-8 minutes (vs. 45+ minutes manual log analysis)
Key Benefits
Proactive Governance
- β Continuous compliance monitoring
- β Automated best practice validation
- β 95% reduction in manual review time
Reactive Incident Response
- π¨ Automated root cause analysis
- β‘ 80-95% reduction in MTTR
- π§ Context-aware remediation recommendations
- π Evidence-based troubleshooting
Operational Impact
| Metric | Before | After | Improvement |
|---|---|---|---|
| Compliance review time | 2-3 hours | 5 minutes | 95% |
| Job failure investigation | 30-45 min | 3-8 min | 85% |
| On-call alerts requiring intervention | 4-6 per shift | 1-2 per shift | 70% |
Conclusion
Azure SRE Agent transforms Databricks operations by combining proactive governance with reactive troubleshooting. The MCP integration provides:
- Comprehensive visibility into workspace health
- Automated compliance monitoring and validation
- Intelligent incident response with root cause analysis
- Self-healing capabilities for common failures
Result: Teams spend less time firefighting and more time building.
Resources
- π Deployment Guide
- π€ Subagent Configuration
- π Best Practices Document
- π§° Ops Skill Runbook
- π§ Validation Script
- π Azure SRE Agent Documentation
- π° Azure SRE Agent Blogs
- π MCP Specification
Questions? Open an issue on GitHub or reach out to the Azure SRE team.