Apps on Azure Blog

5 MIN READ

Proactive Cloud Ops with SRE Agent: Scheduled Checks for Cloud Optimization

dchelupati

Microsoft

Jan 19, 2026

Cloud operations isn't just about keeping things running - it's about running them better.

The Cloud Optimization Challenge

Your cloud environment is always changing:

New features ship weekly
Traffic patterns shift seasonally
Costs creep up quietly
Security best practices evolve
Teams spin up resources and forget them

It's Monday morning. You open the Azure portal. Everything looks... fine. But "fine" isn't great. That VM has been at 8% CPU for weeks. A Key Vault secret expires in 12 days.

Nothing's broken. But security is drifting, costs are creeping, and capacity gaps are growing silently.

The question isn't "is something broken?" it's "could this be better?"

Four Pillars of Cloud Optimization

Pillar	What Teams Want	The Challenge
Security	Stay compliant, reduce risk	Config drift, legacy settings, expiring creds
Cost	Spend efficiently, justify budget	Hard to spot waste across 100s of resources
Performance	Meet SLOs, handle growth	Know when to scale before demand hits
Availability	Maximize uptime, build resilience	Hidden dependencies, single points of failure

Most teams check these sometimes. SRE Agent checks them continuously.

Enter SRE Agent + Scheduled tasks

SRE Agent can pull data from Azure Monitor, resource configurations, metrics, logs, traces, errors, cost data and analyze it on a schedule. If you use tools outside Azure (Datadog, PagerDuty, Splunk), you can connect those via MCP servers so the agent sees your full observability stack.

My setup uses Azure-native sources. Here's how I wired it up.

How I Set It Up: Step by Step

Step 1: Create SRE Agent with Subscription Access

I created an SRE Agent without attaching it to any specific resource group. Instead, I gave it Reader access at the subscription level. This lets the agent scan across all my resource groups for optimization opportunities.

No resource group configuration needed. The agent builds a knowledge graph of everything VMs, storage accounts, Key Vaults, NSGs, web apps across the subscription.

Step 2: Create and Upload My Organization Practices

I created an org-practices.md file that defines what "good" looks like for my team:

I uploaded this to SRE Agent's knowledge base. Now the agent knows our bar, not just Azure defaults.

👉 See my full org-practices.md

Source repos for this demo:

security-demoapp - App with intentional security misconfigurations
costoptimizationapp - App with cost optimization opportunities

Step 3: Connect to Teams Channel

I connected SRE Agent to my team's Teams channel so findings land where we already work. Critical findings get immediate notifications. Warnings go into a daily digest. No more logging into separate dashboards. The insights come to us.

Step 4: Connect Resource Groups to GitHub Repos

Add the two resource groups to the SRE Agent and link the apps to their corresponding GitHub repos:

Resource Group	GitHub Repository
rg-security-opt-demo	security-demoapp
rg-cost-opt-sreademo	costoptimizationapp

This enables the agent to create GitHub issues for findings linking violations directly to the repo responsible for that infrastructure.

Step 5: Test with Prompts

Before setting up automation, I tested the agent with manual prompts to make sure it was finding the right issues. The agent ran the checks, compared against my org-practices.md, and identified the issues.

Security Check:

Scan resource group "rg-security-opt-demo" for any violations of our security practices defined in org-practices.md in your knowledge base. list violations with severity and remediation steps. Make sure to check against all critical requirements and send message in teams channel with your findings and create an issue in the github repo https://github.com/dm-chelupati/security-demoapp.git

Cost Check:

Scan resource group "rg-cost-opt-sreademo" for any violations of our costpractices defined in org-practices.md in your knowledge base. list violations with severity and remediation steps. Make sure to check against all critical requirements and send message in teams channel with your findings and create an issue in the github repo https://github.com/dm-chelupati/costoptimizationapp.git

Step 6: Check Output via GitHub Issues

After running prompts, I checked GitHub. The agent had created issues. Each issue has the root cause, impact, and fix ready for the team to action or for Coding Agent to pick up and create a PR.

👉 See the actual issues created:

Step 7: Set Up Scheduled Triggers

This is where it gets powerful. I configured recurring schedules:

Weekly Security Check (Wednesdays 8 AM):

Create a scheduled trigger that performs security practices checks against the org practices in knowledge base org-practices.md, creates github issue and send teams message on a weekly basis Wednesdays at 8 am UTC

Weekly Cost Review (Mondays 8 AM):

Create a scheduled trigger that performs cost practices checks against the org practices in knowledge base org-practices.md, creates github issue and send teams message on a weekly basis on Mondays at 8 am UTC

Now optimization runs automatically. Every week, fresh findings land in GitHub Issues and Teams.

Why Context Makes the SRE Agent Powerful

Think about hiring a new SRE. They're excellent at their craft—they know Kubernetes, networking, Azure inside out. But on day one, they can't solve problems in your environment yet. Why? They don't have context:

What are your SLOs? What's "acceptable" latency for your app?
When do you rotate secrets? Monthly? Quarterly? Before each release?
Which resources are production-critical vs. dev experiments?
What's your tagging policy? Who owns what?
How do you deploy? GitOps? Pipelines? Manual approvals?

A great engineer becomes your great engineer once they learn how your team operates.

SRE Agent works the same way.

Out of the box, it knows Azure resource types, networking, best practices. But it doesn't know your bar. Is 20% CPU utilization acceptable or wasteful? Should secrets expire in 30 days or 90? Are public endpoints ever okay, or never?

The more context you give the agent, your SLOs, your runbooks, your policies, the more it reasons like a team member who understands your environment, not just Azure in general.

That's why Step 2 matters so much. When I uploaded our standards, the agent stopped checking generic Azure best practices and started checking our best practices.

Bring your existing knowledge: You don't have to start from scratch. If your team's documentation already lives in Atlassian Confluence, SharePoint, or other tools, you can connect those via MCP servers. The agent pulls context from where your team already works, no need to duplicate content.

Why This Matters

Before this setup, optimization was a quarterly thing. Now it happens automatically:

Before	After
Check security when audit requests it	Daily automated posture check
Find waste when finance complains	Weekly savings report in Teams
Discover capacity issues during incidents	Scheduled headroom analysis
Expire credentials and debug at 2 AM	30-day warning with exact secret names

Optimization isn't a project anymore. It's a practice.

Try It Yourself

Create an SRE Agent with access to your subscription
Upload your team's standards (security policies, cost thresholds, tagging rules)
Set up a scheduled trigger, start with a daily security check
Watch the first report land in Teams

See what you've been missing while everything looked "fine."