Cloud operations isn't just about keeping things running - it's about running them better.
The Cloud Optimization Challenge
Your cloud environment is always changing:
- New features ship weekly
- Traffic patterns shift seasonally
- Costs creep up quietly
- Security best practices evolve
- Teams spin up resources and forget them
It's Monday morning. You open the Azure portal. Everything looks... fine. But "fine" isn't great. That VM has been at 8% CPU for weeks. A Key Vault secret expires in 12 days.
Nothing's broken. But security is drifting, costs are creeping, and capacity gaps are growing silently.
The question isn't "is something broken?" it's "could this be better?"
Four Pillars of Cloud Optimization
| Pillar | What Teams Want | The Challenge |
|---|---|---|
| Security | Stay compliant, reduce risk | Config drift, legacy settings, expiring creds |
| Cost | Spend efficiently, justify budget | Hard to spot waste across 100s of resources |
| Performance | Meet SLOs, handle growth | Know when to scale before demand hits |
| Availability | Maximize uptime, build resilience | Hidden dependencies, single points of failure |
Most teams check these sometimes. SRE Agent checks them continuously.
Enter SRE Agent + Scheduled tasks
SRE Agent can pull data from Azure Monitor, resource configurations, metrics, logs, traces, errors, cost data and analyze it on a schedule. If you use tools outside Azure (Datadog, PagerDuty, Splunk), you can connect those via MCP servers so the agent sees your full observability stack.
My setup uses Azure-native sources. Here's how I wired it up.
How I Set It Up: Step by Step
Step 1: Create SRE Agent with Subscription Access
I created an SRE Agent without attaching it to any specific resource group. Instead, I gave it Reader access at the subscription level. This lets the agent scan across all my resource groups for optimization opportunities.
No resource group configuration needed. The agent builds a knowledge graph of everything VMs, storage accounts, Key Vaults, NSGs, web apps across the subscription.
Step 2: Create and Upload My Organization Practices
I created an org-practices.md file that defines what "good" looks like for my team:
I uploaded this to SRE Agent's knowledge base. Now the agent knows our bar, not just Azure defaults.
👉 See my full org-practices.md
Source repos for this demo:
- security-demoapp - App with intentional security misconfigurations
- costoptimizationapp - App with cost optimization opportunities
Step 3: Connect to Teams Channel
I connected SRE Agent to my team's Teams channel so findings land where we already work. Critical findings get immediate notifications. Warnings go into a daily digest. No more logging into separate dashboards. The insights come to us.
Step 4: Connect Resource Groups to GitHub Repos
Add the two resource groups to the SRE Agent and link the apps to their corresponding GitHub repos:
| Resource Group | GitHub Repository |
|---|---|
| rg-security-opt-demo | security-demoapp |
| rg-cost-opt-sreademo | costoptimizationapp |
This enables the agent to create GitHub issues for findings linking violations directly to the repo responsible for that infrastructure.
Step 5: Test with Prompts
Before setting up automation, I tested the agent with manual prompts to make sure it was finding the right issues. The agent ran the checks, compared against my org-practices.md, and identified the issues.
Security Check:
Scan resource group "rg-security-opt-demo" for any violations of our security practices defined in org-practices.md in your knowledge base. list violations with severity and remediation steps. Make sure to check against all critical requirements and send message in teams channel with your findings and create an issue in the github repo https://github.com/dm-chelupati/security-demoapp.git
Cost Check:
Scan resource group "rg-cost-opt-sreademo" for any violations of our costpractices defined in org-practices.md in your knowledge base. list violations with severity and remediation steps. Make sure to check against all critical requirements and send message in teams channel with your findings and create an issue in the github repo https://github.com/dm-chelupati/costoptimizationapp.git
Step 6: Check Output via GitHub Issues
After running prompts, I checked GitHub. The agent had created issues. Each issue has the root cause, impact, and fix ready for the team to action or for Coding Agent to pick up and create a PR.
👉 See the actual issues created:
Step 7: Set Up Scheduled Triggers
This is where it gets powerful. I configured recurring schedules:
Weekly Security Check (Wednesdays 8 AM):
Create a scheduled trigger that performs security practices checks against the org practices in knowledge base org-practices.md, creates github issue and send teams message on a weekly basis Wednesdays at 8 am UTC
Weekly Cost Review (Mondays 8 AM):
Create a scheduled trigger that performs cost practices checks against the org practices in knowledge base org-practices.md, creates github issue and send teams message on a weekly basis on Mondays at 8 am UTC
Now optimization runs automatically. Every week, fresh findings land in GitHub Issues and Teams.
Why Context Makes the SRE Agent Powerful
Think about hiring a new SRE. They're excellent at their craft—they know Kubernetes, networking, Azure inside out. But on day one, they can't solve problems in your environment yet. Why? They don't have context:
- What are your SLOs? What's "acceptable" latency for your app?
- When do you rotate secrets? Monthly? Quarterly? Before each release?
- Which resources are production-critical vs. dev experiments?
- What's your tagging policy? Who owns what?
- How do you deploy? GitOps? Pipelines? Manual approvals?
A great engineer becomes your great engineer once they learn how your team operates.
SRE Agent works the same way.
Out of the box, it knows Azure resource types, networking, best practices. But it doesn't know your bar. Is 20% CPU utilization acceptable or wasteful? Should secrets expire in 30 days or 90? Are public endpoints ever okay, or never?
The more context you give the agent, your SLOs, your runbooks, your policies, the more it reasons like a team member who understands your environment, not just Azure in general.
That's why Step 2 matters so much. When I uploaded our standards, the agent stopped checking generic Azure best practices and started checking our best practices.
Bring your existing knowledge: You don't have to start from scratch. If your team's documentation already lives in Atlassian Confluence, SharePoint, or other tools, you can connect those via MCP servers. The agent pulls context from where your team already works, no need to duplicate content.
Why This Matters
Before this setup, optimization was a quarterly thing. Now it happens automatically:
| Before | After |
|---|---|
| Check security when audit requests it | Daily automated posture check |
| Find waste when finance complains | Weekly savings report in Teams |
| Discover capacity issues during incidents | Scheduled headroom analysis |
| Expire credentials and debug at 2 AM | 30-day warning with exact secret names |
Optimization isn't a project anymore. It's a practice.
Try It Yourself
- Create an SRE Agent with access to your subscription
- Upload your team's standards (security policies, cost thresholds, tagging rules)
- Set up a scheduled trigger, start with a daily security check
- Watch the first report land in Teams
See what you've been missing while everything looked "fine."
Learn More
- Azure SRE Agent documentation
- Azure SRE Agent blogs
- Azure SRE Agent community
- Azure SRE Agent home page
- Azure SRE Agent pricing
Azure SRE Agent is currently in preview. Get Started