red hat
35 TopicsApplying Site Reliability Engineering to Autonomous AI Agents
If you practice SRE, you already have a mental model for running reliable production systems. You define SLOs. You track error budgets. You use circuit breakers to stop cascading failures. You run chaos experiments to find weaknesses before customers do. You treat every operational decision as a tradeoff between reliability and velocity. That mental model transfers directly to AI agents. It just needs four new ideas. In the Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents, we covered Agent SRE briefly as one of AGT's nine packages: SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, adapted from the patterns your SRE team already applies to microservices. Several teams asked for the full story. This is it. Agent SRE is one of the more novel parts of the toolkit. The policy engine, zero-trust identity, and execution sandboxing have clear analogs in existing security practice. Agent SRE explores newer ground. Established patterns for defining SLOs for AI agent behavior, building chaos experiments for LLM provider failures, or applying error budgets to agent autonomy are still emerging across the industry. We built these capabilities because running agents in production without them is the equivalent of running a fleet of microservices without circuit breakers, health checks, or an on-call runbook. This post is for SRE teams, platform engineers, and anyone responsible for running AI agents in production. You do not need to be an AI specialist. If you know what a burn rate is, you are ready for this. The Problem: Agents Fail in Ways Your Existing SRE Tooling Cannot See When a service fails, your observability stack tells you: latency went up, error rate crossed the SLO threshold, the circuit breaker opened. You page the on-call engineer. They look at traces and find the slow database query. When an AI agent fails, your observability stack is silent. The agent returned HTTP 200. Latency was normal. Error rate was zero. But the agent quietly approved a transaction it was not authorized to approve, hallucinated a database path and wrote to the wrong table, or got stuck in a reasoning loop that consumed $800 of LLM API budget before anyone noticed. These are not infrastructure failures. They are behavioral failures. And they are invisible to monitoring tools built for stateless, deterministic services, because those tools only watch for crashes and timeouts. They do not watch for wrong behavior. This gap is the problem Agent SRE was designed to solve. The solution borrows everything from the SRE playbook and adds one concept that extends it: the Safety SLI. The Safety SLI: A New Reliability Dimension Traditional SLIs measure system behavior from the user's perspective: latency, availability, error rate, throughput. They answer: did the service respond correctly? For AI agents, correctness is not enough. An agent that responds correctly but acts outside its authorized scope has not succeeded. It has failed in a way that none of your existing SLIs can detect. The Safety SLI answers a different question: did the agent act within policy? from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import PolicyCompliance # Define a safety SLO: 99% of agent actions must comply with policy safety_slo = SLO( name="safety-compliance", indicators=[ PolicyCompliance( target=0.99, window="7d", ), ], error_budget=ErrorBudget( total=0.01, # 1% budget (1 - 0.99 target) window_seconds=2592000, # 30-day window burn_rate_alert=2.0, # warn at 2x sustainable rate burn_rate_critical=5.0, # page at 5x sustainable rate ), ) When an agent's policy compliance rate drops below 99%, the error budget starts burning. The ErrorBudget tracks consumption automatically and exposes burn rate alerts through its firing_alerts() method. When the budget is exhausted, the configured exhaustion_action determines the system response: from agent_sre.slo.objectives import ExhaustionAction # Configure what happens when error budget is exhausted safety_slo = SLO( name="safety-compliance", indicators=[PolicyCompliance(target=0.99, window="7d")], error_budget=ErrorBudget( total=0.01, window_seconds=2592000, burn_rate_alert=2.0, # fires at 2x sustainable burn rate burn_rate_critical=5.0, # fires at 5x sustainable burn rate exhaustion_action=ExhaustionAction.CIRCUIT_BREAK, # suspend agent when budget is gone ), ) # In your monitoring loop, check for firing alerts alerts = safety_slo.error_budget.firing_alerts() for alert in alerts: print(f"Alert firing: {alert.name} (severity: {alert.severity})") # Check budget status print(f"Budget remaining: {safety_slo.error_budget.remaining_percent:.1f}%") print(f"Current burn rate: {safety_slo.error_budget.burn_rate():.2f}x") print(f"Exhausted: {safety_slo.error_budget.is_exhausted}") This is the governance dial from the other direction. The error budget is not just a metric: it is the mechanism that drives agent autonomy decisions. An agent with a clean 30-day safety record earns autonomy. An agent whose budget is burning at 5x the sustainable rate triggers a critical alert, and when the budget is exhausted, the exhaustion_action fires: ALERT, THROTTLE, FREEZE_DEPLOYMENTS, or CIRCUIT_BREAK. The graduated response mirrors what SRE teams already do with service SLOs, applied to agent behavior. There are multiple SLI dimensions built into Agent SRE. Safety SLIs and Performance SLIs track different aspects of the same agent: SLI Type What It Measures Target Pattern When Budget Burns Safety SLI PolicyCompliance -- fraction of actions within authorized scope >= 99% Restrict capabilities, increase human oversight Performance SLI TaskSuccessRate, ResponseLatency, CostPerTask Configurable per workload Alert, throttle, or circuit-break LLM provider Additional built-in indicators include ToolCallAccuracy, DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI. Both SLOs feed into the same error budget dashboard. An agent can have excellent performance but a degrading safety record, or perfect safety compliance and terrible cost efficiency. You need both dimensions to understand whether an agent is production-ready. Circuit Breakers: Governing Agent Failure Modes That Don't Exist in Microservices Circuit breakers for services protect against one failure mode: a backend that is slow or unreachable. The pattern is CLOSED -> OPEN -> HALF_OPEN. You know it well. Agent SRE implements the same state machine for failure modes that are specific to autonomous reasoning systems and do not exist in traditional microservice architectures: from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker from agent_sre.chaos.engine import FaultType config = CircuitBreakerConfig( failure_threshold=5, # Open after 5 failures in the window recovery_timeout_seconds=60, # Stay OPEN for 60s before HALF_OPEN half_open_max_calls=3, # Allow 3 probes in HALF_OPEN ) breaker = CircuitBreaker(agent_id="analyst-agent-001", config=config) # Failure modes tracked by the circuit breaker: tracked_faults = [ FaultType.POLICY_BYPASS, # Agent exceeds authorized scope FaultType.ERROR_INJECTION, # Upstream model API fails FaultType.TIMEOUT_INJECTION, # Tool calls exceed time budget FaultType.TRUST_PERTURBATION, # Agent trust score falls below threshold FaultType.DEADLOCK_INJECTION, # Agent stuck in iterative reasoning ] Each failure mode has different circuit-breaking semantics: Failure Mode What Triggers It Circuit-Break Behavior Policy bypass Action denied by policy engine Count toward threshold; log with full context LLM provider error HTTP 5xx from model API Immediately open; route to fallback model if configured Tool timeout Tool call exceeds timeout_ms Count toward threshold; cancel in-flight call Trust score degradation Agent trust score drops below configured floor Open; escalate to Ring 3 (untrusted) until score recovers Reasoning loop / deadlock Token or iteration count exceeds budget Open; trigger human review before resuming The reasoning loop breaker deserves attention. A microservice cannot get stuck reasoning. An AI agent absolutely can, and when it does, the failure is not an error code: it is an agent that keeps calling tools, consuming tokens, and generating audit events indefinitely. The circuit breaker detects this pattern from the iteration count and token budget and terminates the loop: # Reasoning loop detection configuration loop_detection_config = { "max_iterations": 15, # Hard stop after 15 reasoning steps "max_tokens_per_session": 50000, # Hard stop on token consumption "repetition_threshold": 0.85, # Stop if >85% of recent actions repeat prior ones "on_detection": "circuit_break_and_escalate", } The state machine behaves identically to what you know from Hystrix or Resilience4j. What changes is the definition of "failure." CLOSED (serving) | | failure_threshold crossed for any tracked fault v OPEN (rejecting -- agent action denied, fallback or human-in-loop fires) | | recovery_timeout expires v HALF_OPEN (probe -- limited requests allowed through) | |-- success_threshold met --> CLOSED |-- any failure --> OPEN (reset timeout) Chaos Engineering for Agents: Fault Injection for Autonomous Systems The only way to know if your agent system is resilient is to break it intentionally. Traditional chaos engineering targets infrastructure: kill a pod, inject network latency, saturate a disk. Agent chaos engineering targets the failure modes specific to autonomous reasoning systems. Agent SRE ships fault injection templates that cover the failure modes teams consistently underestimate until they hit production: from agent_sre.chaos.engine import ChaosExperiment, Fault, FaultType # Experiment 1: LLM provider degrades -- model returns valid responses but with # increased latency and occasional malformed outputs experiment = ChaosExperiment( name="llm-degradation-resilience", target_agent="analyst-agent-001", description="Test agent behavior under degraded LLM provider", faults=[ Fault.latency_injection(target="llm-provider", delay_ms=8000), Fault.error_injection(target="llm-provider", rate=0.05), ], duration_seconds=300, ) # Experiment 2: Trust score manipulation -- simulates an agent receiving # messages from a peer with a spoofed trust score trust_experiment = ChaosExperiment( name="trust-manipulation-resilience", target_agent="orchestrator-001", faults=[ Fault( fault_type=FaultType.TRUST_PERTURBATION, target="did:mesh:orchestrator-001", params={"spoofed_score": 950}, ), ], duration_seconds=120, ) # Experiment 3: Tool timeout cascade -- multiple tools time out simultaneously, # testing whether the agent abandons gracefully or enters a reasoning loop cascade_experiment = ChaosExperiment( name="tool-timeout-cascade", target_agent="analyst-agent-001", faults=[ Fault.timeout_injection(target="database.read", delay_ms=30000), Fault.timeout_injection(target="api.call", delay_ms=30000), ], duration_seconds=180, ) # Run the experiment experiment.start() # ... inject faults during agent execution ... resilience = experiment.calculate_resilience( baseline_success_rate=0.95, experiment_success_rate=0.87, recovery_time_ms=48000, ) experiment.complete(resilience=resilience) print(f"Resilience score: {resilience.overall}/100 -- {'PASSED' if resilience.passed else 'FAILED'}") Additional fault types built into the chaos engine cover: prompt injection attempts, privilege escalation, data exfiltration attempts, identity spoofing, deadlock injection, and contradictory instruction scenarios. Each maps to a FaultType enum value and can be composed into multi-fault experiments. Important: The chaos engine records that a fault was injected and triggers the governance response pipeline. Actual infrastructure-level fault injection (network partition, process kill) should be implemented using your existing chaos tooling (Chaos Mesh, Gremlin, Azure Chaos Studio, or similar). Agent SRE governs the agent's behavioral response to faults; it does not own infrastructure manipulation. These two layers are designed to compose. Each chaos experiment produces a structured resilience score via calculate_resilience(), which compares baseline and experiment success rates. A score of 90+ with passed=True means the agent maintained at least 90% of its baseline performance under fault conditions. Teams use this to set minimum resilience thresholds for production readiness. Replay Debugging: Reproduce Behavioral Failures Exactly Infrastructure incidents are reproducible because infrastructure is deterministic. AI agent incidents are hard to reproduce because agent behavior depends on model state, context window content, and the sequence of tool call results, none of which are preserved by default after a session ends. Agent SRE's replay engine records every agent session as a replayable artifact: the full trace at each step, every tool call with its inputs and outputs, every policy evaluation with its decision, and every trust score at the time of each inter-agent message. from agent_sre.replay.capture import TraceStore from agent_sre.replay.engine import ReplayEngine, ReplayMode # Traces are captured automatically when SRE tracing is active store = TraceStore( backend="azure_blob", retention_days=30, ) # When an incident occurs, replay the session exactly engine = ReplayEngine(store=store) # Full replay: re-run the session against the same recorded inputs # Uses recorded tool outputs -- no live tool calls -- so replay is deterministic result = await engine.replay( trace_id="trace_2026_05_a7f3b2", mode=ReplayMode.FULL, ) for step in result.steps: print(f"Step {step.index}: {step.action} -> {step.decision}") # Divergence analysis: replay with a policy change applied # Shows exactly which actions would have been blocked under the new policy diff_result = await engine.diff( trace_id="trace_2026_05_a7f3b2", policy_override="policies/stricter-v2.yaml", ) for diff in diff_result.diffs: if diff.description: print(f"Step {diff.span_name}: was {diff.original}, " f"would be {diff.replayed} under new policy") The divergence analysis is the feature teams use most. When a policy change is proposed, you replay recent production traces against the new policy to see how many actions would have been blocked, which sessions would have failed, and what the error budget impact would have been. Policy changes stop being guesswork. Progressive Delivery: Safely Rolling Out New Agent Capabilities When you ship a new service version, you do not send it to all traffic at once. You use canary deployments, feature flags, or traffic splitting. You watch the SLOs. If they degrade, you roll back. Agent SRE brings the same discipline to agent capability rollout. When you expand an agent's authorized scope, giving it write access it did not have, connecting it to a new tool, or raising its trust floor, you do not expand to the full fleet immediately. You expand progressively, with automated SLO gates controlling each stage. from agent_sre.delivery.rollout import ( AnalysisCriterion, CanaryRollout, RollbackCondition, RolloutStep, ) rollout = CanaryRollout( name="database-write-capability", steps=[ RolloutStep( name="canary", weight=0.05, # 5% of agents get the new capability duration_seconds=86400, # 24 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.995), AnalysisCriterion(metric="performance_sli", threshold=0.90), AnalysisCriterion( metric="error_budget_consumed", threshold=0.10, comparator="lte", # canary can burn at most 10% ), ], ), RolloutStep( name="early-adopters", weight=0.25, # 25% traffic duration_seconds=172800, # 48 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.88), ], ), RolloutStep( name="general-availability", weight=1.0, # 100% traffic duration_seconds=604800, # 1 week of full observation analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.85), ], ), ], rollback_conditions=[ RollbackCondition(metric="safety_sli", threshold=0.95, comparator="lte"), ], ) # Start the rollout -- SLO gates evaluate at each step rollout.start() # Advance to next step when analysis criteria pass if rollout.advance(): print(f"Advanced to step: {rollout.current_step.name}") print(f"Progress: {rollout.progress_percent:.0f}%") The SLO gate at each step is the same mechanism as a CI/CD quality gate, but measured on live production behavior rather than test results. An agent capability that degrades the safety SLI during canary does not promote to the next step. If a RollbackCondition fires, the rollout rolls back automatically. This is the mechanism that makes it operationally safe to expand agent autonomy: every expansion is measurable, every measurement gates the next expansion, and rollback is automatic. Health Checks and Backpressure Traditional health checks answer: is the service alive? For agents, alive is not enough. A healthy agent is one that is alive, operating within policy, consuming resources within budget, and maintaining a trust score above the Ring threshold it was assigned. # Agent health check covering multiple dimensions health = await agent_health_check( agent_id="analyst-agent-001", dimensions=[ "liveness", # Is the agent process running? "policy_compliance", # Is safety SLI above threshold? "trust_score", # Is trust score above Ring floor? "resource_budget", # Is token/API spend within limits? "tool_availability", # Are the tools the agent needs reachable? ], ) # health.status: "healthy" | "degraded" | "unhealthy" # health.dimensions: per-dimension pass/fail with values # health.recommended_action: "none" | "restrict" | "suspend" | "terminate" When health checks report degradation, backpressure controls engage before the circuit breaker opens. Backpressure is the earlier, softer response: accept fewer concurrent tasks, reject low-priority work, drain in-flight tasks gracefully before the situation escalates. # Backpressure configuration backpressure_config = { "backpressure_threshold": 0.80, # Engage when resource utilization > 80% "max_concurrent": 5, # Hard cap on simultaneous agent tasks "priority_shedding": True, # Drop low-priority tasks first "drain_timeout_seconds": 30, # Allow in-flight tasks to complete } The ordering matters: backpressure first, then circuit breaker, then suspension. Each stage is recoverable. Each stage preserves more agent state than the next. The SRE principle of graduated response applies to agents exactly as it applies to services. Observability: Governance Metrics Flow Into Your Existing Stack Agent SRE does not ask you to adopt a new observability platform. Governance metrics are exported through the same adapters your infrastructure monitoring already uses, including OpenTelemetry, Prometheus, Datadog, and others. from agent_sre.tracing.exporters import configure_exporters configure_exporters( backends=[ {"type": "prometheus", "endpoint": "http://prometheus:9090"}, {"type": "opentelemetry", "endpoint": "http://otel-collector:4317"}, ], include_metrics=[ "slo.safety_sli", # Per-agent safety compliance rate "slo.error_budget_remaining", # Error budget in percentage "slo.burn_rate", # Current burn rate vs sustainable "circuit_breaker.state", # CLOSED / OPEN / HALF_OPEN "circuit_breaker.failure_count", "trust_score.current", # Agent trust score (0-1000) "trust_score.ring", # Current execution ring "chaos.experiments_run", # Chaos experiment telemetry "health.status", # Aggregate health status "backpressure.load", # Current load vs threshold ], ) Key governance metrics available in your existing dashboards: Metric What It Tells You Alert Condition slo.safety_sli Fraction of agent actions within policy < 0.99 slo.burn_rate Rate at which error budget is consumed > 2.0 (warn), > 5.0 (page) slo.error_budget_remaining Budget left for the SLO window < 20% circuit_breaker.state Current breaker state per agent OPEN or HALF_OPEN trust_score.ring Execution ring (privilege level) Ring 3 (untrusted) health.status Aggregate health across all dimensions degraded or unhealthy If you are already running Grafana dashboards for your services, a governance dashboard for your agent fleet is a new data source and a new set of panels, not a new monitoring stack. The SRE Mental Model for Agents: Four New Concepts Everything in Agent SRE is built on the SRE mental model you already have, extended with four concepts that adapt traditional reliability thinking for autonomous systems: Traditional SRE Agent SRE Equivalent What Changes Latency SLI Safety SLI Correctness of *action*, not speed of *response* Error budget Autonomy budget Burns on policy violations, not just errors Circuit breaker Behavioral circuit breaker Opens on wrong *behavior*, not just failure codes Canary deployment Capability rollout Rolls out *scope*, not just code The governance insight is that error budgets work in both directions for agents. A service's error budget only decreases. An agent's autonomy is also a budget: it grows when the safety SLI is strong and shrinks when it degrades. The error budget mechanism becomes the operational mechanism for expanding and contracting agent autonomy in response to evidence, which is exactly what regulated industries and risk-averse enterprise teams need before they will trust an autonomous agent with consequential actions. Getting Started with Agent SRE pip install agent-sre A minimal Agent SRE integration requires three things: a safety SLO definition, a circuit breaker, and a health check. The progressive delivery and chaos engineering features layer on top when you are ready for them. from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import TaskSuccessRate from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker # Step 1: Define your safety SLO slo = SLO( name="production-safety", indicators=[TaskSuccessRate(target=0.99, window="24h")], error_budget=ErrorBudget(total=0.01, burn_rate_alert=2.0, burn_rate_critical=5.0), ) # Step 2: Configure a circuit breaker breaker_config = CircuitBreakerConfig( failure_threshold=5, recovery_timeout_seconds=60, half_open_max_calls=3, ) breaker = CircuitBreaker(agent_id="my-agent", config=breaker_config) # Step 3: Wire into your existing agent loop async def governed_agent_loop(agent, task): # Check health first if not await agent_is_healthy(agent.id): return {"error": "agent suspended", "reason": "health check failed"} # Run within circuit breaker protection async with breaker: result = await agent.run(task) slo.record_event(good=result.policy_compliant) return result The quickstart in the repository walks through a complete setup with safety SLOs, circuit breakers, and a Prometheus dashboard export in under 50 lines. Why This Matters Most AI observability tools today focus on what you might call model quality: hallucination rate, latency, token cost, task completion. These are useful metrics. They are not SRE metrics. They do not answer whether the agent acted within its authorized scope, whether its behavioral error budget is burning at a dangerous rate, or whether it would survive the LLM provider going down. Agent SRE answers those questions using the operational vocabulary that SRE teams already understand: SLOs, error budgets, circuit breakers, chaos experiments, and health checks. The goal is not to replace your observability stack. It is to make agent governance visible inside it. The reliability of an autonomous agent is not a property of the model. It is a property of the governance infrastructure around it. Agent SRE is that infrastructure. Resources GitHub: github.com/microsoft/agent-governance-toolkit Install: pip install agent-sre Tutorials: 40+ tutorials including dedicated Agent SRE walkthroughs for SLO setup, chaos experiments, and progressive delivery Architecture reference: ARCHITECTURE.md OWASP compliance mapping: OWASP-COMPLIANCE.md -- Agent SRE addresses ASI-08 (Cascading Failures) directly through circuit breakers and SLO-based fault detection Part 1 -- Runtime governance: Policy engines, trust, and SRE overview Part 2 -- Shift-left governance: Catching violations before production Part 3 -- Post-hoc accountability: After the agent acts The Agent Governance Toolkit is an open-source project released under the MIT License. All features described in this post are available in the public repository. The `agent-sre` package is currently in public preview; APIs may change before general availability. Questions about Agent SRE in your environment? Open an issue at aka.ms/agent-governance-toolkit or start a discussion in the comments below.402Views1like0CommentsRed Hat Enterprise Linux Software Reservations Now Available
What's New After careful collaboration with Red Hat, we've updated our RHEL pay-as-you-go billing meters and reservation pricing to align with Red Hat's latest pricing model. These updates address previous billing meter issues and ensure an accurate, transparent experience for our customers. Starting today, customers can purchase RHEL software reservations on Azure with updated pricing, allowing organizations to reduce their Linux workload costs with the flexibility and reliability of software reservations. Key Benefits of RHEL Software Reservations Cost savings: Save up to 24% compared to pay-as-you-go pricing by committing to a one-year term Predictable costs: Lock in pricing for your RHEL workloads and optimize your cloud budget Pricing clarity: Updated meters aligned with Red Hat's current pricing model Seamless Azure integration: Manage your RHEL software reservations alongside other Azure resources. RHEL software reservations allow you to pre-purchase RHEL software capacity at a discounted rate, delivering significant savings over standard pay-as-you-go pricing. Get Started Today RHEL software reservations are available now on the Azure portal. To learn more about pricing, terms, and how to purchase them, visit the following pages: Pricing - Linux Virtual Machines | Microsoft Azure Prepay for software plans - Azure Reservations - Azure Virtual Machines | Microsoft Learn Red Hat reservation plan discounts - Azure - Microsoft Cost Management | Microsoft Learn What are Azure Reservations? - Microsoft Cost Management | Microsoft Learn We're committed to providing transparent, reliable billing for all Azure services and appreciate your continued partnership as we deliver the best cloud platform for your open-source workloads. For questions or support, please contact Azure Support or your Microsoft account team.892Views1like0CommentsRed Hat Enterprise Linux 10 Image Mode on Azure Quick Start Guide
This guide will get you up and running with Image Mode on RHEL on Azure to help demonstrate how the technology works and how it can save you time and effort in managing your RHEL estate on Azure. To get started, you’ll need the following: Container Registry Linux server for building containers Azure Subscription for deploying a RHEL VM in Image Mode On an existing Linux system with “az” installed (and logged into an Azure account), let’s go ahead and begin by creating a resource group: az group create -l eastus -n linux-vms Now on that same system, let’s generate a ssh key for logging into vms: ssh-keygen -f ~/.ssh/linuxkey -t rsa And hit enter at both passphrase prompts. This will generate an ssh key for the purpose of this quickstart. In production, you may wish to actually use a passphrase to better protect your ssh keys. Now, we need to create a container registry. To do this on Azure, let’s go to https://portal.azure.com/#create/Microsoft.ContainerRegistry and on the first screen, specify an active subscription and choose “linux-vms” as our resource group. For the name of the registry, it must be unique across the Azure namespace, so try something with your name. In my case, my name is Karl and so I’m going to use “rhel10demo” for the purpose of this quick start. Anywhere you see that name, please replace it with your own registry name. Your first page should look similar to this when completed: Once you’ve filled those values in, click “Review + Create” and then on the next screen, click “Create”. Once the deployment finishes, you will have a container registry ready to go. Back over on our Linux system, let’s now deploy to Azure a Linux server for building containers: az vm create \ --resource-group linux-vms \ --name container-builder \ --image RedHat:rhel-raw:9_5:9.5.2024120516 \ --admin-username core \ --assign-identity \ --ssh-key-values ~/.ssh/linuxkey.pub\ --public-ip-sku Standard Once this command has completed, you’ll have a public ip address specified as the value for “publicAddress” in the output from the above command. This document will use ip.ip.ip.ip in place of an actual ip address to show you where you’d place the value. We also need to copy over our linuxkey.pub file to the container building machine so that we can inject this key into the container images we build: scp -i ~/.ssh/linuxkey ~/.ssh/linuxkey.pub core@ip.ip.ip.ip:.ssh/ Now let’s get started building our image mode container image! ssh core@ip.ip.ip.ip -i ~/.ssh/linuxkey First, we want to install podman and git: sudo dnf install -y podman git Now we need to have the Azure CLI tooling so that we can interact with our Azure Container Registry, to do that, let’s run: sudo rpm --import https://packages.microsoft.com/keys/microsoft.asc We now need to edit /etc/yum.repos.d/azure-cli.repo and make sure that it has the following contents: [azure-cli] name=Azure CLI baseurl=https://packages.microsoft.com/yumrepos/azure-cli enabled=1 gpgcheck=1 gpgkey=https://packages.microsoft.com/keys/microsoft.asc We can now run: sudo dnf install azure-cli -y Once this package has installed, we need to run: az login And follow the instructions. Let’s now generate a token credential granting us registry access by doing the following: (Remember to replace “rhel10demo” with your own registry name.) az acr token create --name mytoken --resource-group linux-vms --registry rhel10demo --scope-map _repositories_push "creationDate": "2025-04-23T14:02:59.654151+00:00", "credentials": { "certificates": null, "passwords": [ { "creationTime": "2025-04-23T14:03:10.774839+00:00", "expiry": null, "name": "password1", "value": "roF8leBAMlfeTS2Jb75WmaK/Z0WATcRK/sNowu6Str+ACRAZruef" }, { "creationTime": "2025-04-23T14:03:10.774858+00:00", "expiry": null, "name": "password2", "value": "vzRJsZwsp4PX/ToXTmAFQitz2yYye+dqHZXbbUXZrD+ACRBPzPXi" } ], "username": "mytoken" }, … In the output, we will see two passwords. Let’s grab password1, which in our example looks like: (You’ll want to save this information as it’s the only time you’ll get to see it!) "passwords": [ { "creationTime": "2025-04-23T14:03:10.774839+00:00", "expiry": null, "name": "password1", "value": "roF8leBAMlfeTS2Jb75WmaK/Z0WATcRK/sNowu6Str+ACRAZruef" }, Here the password is the value of the “value” parameter, and we can now log into the registry with the following command: (Remember to replace “rhel10demo” with your registry name.) podman login rhel10demo.azurecr.io -u mytoken -p roF8leBAMlfeTS2Jb75WmaK/Z0WATcRK/sNowu6Str+ACRAZruef Next, we want to clone the git repository that has our MSSQL Image Mode example: git clone https://github.com/mrguitar/rhel-mssql-bootc As we are using Azure PAYG (Pay as you Go) images, we need to enable RHUI access inside of containers by running: sudo sh -c "echo -e '/usr/share/rhel/secrets:/run/secrets\n/etc/pki/rhui:/etc/pki/rhui\n/etc/yum.repos.d/rh-cloud-base.repo:/etc/yum.repos.d/rh-cloud-base.repo' >> /etc/containers/mounts.conf" sudo chmod 644 /etc/pki/rhui/{private,product}/* Now, let’s change into the directory of our git repository and copy our auth.json for our container: cd rhel-mssql-bootc mkdir -p etc/ostree cp /run/user/1000/containers/auth.json ./etc/ostree/ We can now examine the Containerfile.azure and see that it has instructions to build and bring up a system. In the from container, we already have MIcrosoft SQL Server installed, so we do not have to add that to our container: FROM quay.io/mrguitar/rhel-mssql-bootc:latest COPY etc/ /etc/ COPY 05-cloud-kargs.toml /usr/lib/bootc/kargs.d/ ARG sshpubkey RUN if test -z "$sshpubkey"; then echo "must provide sshpubkey"; exit 1; fi; \ useradd -G wheel core && \ mkdir -m 0700 -p /home/core/.ssh && \ echo $sshpubkey > /home/core/.ssh/authorized_keys && \ chmod 0600 /home/core/.ssh/authorized_keys && \ chown -R core: /home/core # install required packages and enable services RUN dnf -y install \ WALinuxAgent \ cloud-init \ cloud-utils-growpart \ gdisk \ hyperv-daemons && \ dnf clean all && \ systemctl enable NetworkManager.service && \ systemctl enable waagent.service && \ systemctl enable cloud-init.service && \ echo 'ClientAliveInterval 180' >> /etc/ssh/sshd_config # configure waagent for cloud-init to handle provisioning RUN sed -i 's/Provisioning.Agent=auto/Provisioning.Agent=cloud-init/g' /etc/waagent.conf && \ sed -i 's/ResourceDisk.Format=y/ResourceDisk.Format=n/g' /etc/waagent.conf && \ sed -i 's/ResourceDisk.EnableSwap=y/ResourceDisk.EnableSwap=n/g' /etc/waagent.conf We can now build the container image by running: podman build --build-arg "sshpubkey=$(cat ~/.ssh/linuxkey.pub)" --no-cache -f Containerfile.azure Now that we have our container registry configured and our image built, we can push the image to the registry. Let’s start by finding the image ID for our newly built image: [core@container-builder rhel-mssql-bootc]$ podman images REPOSITORY TAG IMAGE ID CREATED SIZE <none> <none> ba4f948305db 2 minutes ago 3.64 GB quay.io/mrguitar/rhel-mssql-bootc latest 534ac925516a 4 weeks ago 3.47 GB We can now tag this image with the following command (Remember to replace “rhel10demo” with your registry name.): podman tag ba4f948305db rhel10demo.azurecr.io/mssql-image-mode:azure Now we can push our image to our registry: podman push rhel10demo.azurecr.io/mssql-image-mode:azure Now, let’s exit from the container building server and go create a new RHEL 9.5 VM that we will switch from package mode to image mode. This needs to be run on the system that you created the container building image on. az vm create \ --resource-group linux-vms \ --name mssql \ --image RedHat:rhel-raw:9_5:9.5.2024120516 \ --admin-username core \ --assign-identity \ --ssh-key-values ~/.ssh/linuxkey.pub \ --public-ip-sku Standard Once this machine starts, let’s ssh to the system: ssh -i ~/.ssh/linuxkey core@ip.ip.ip.ip Let’s start by installing podman: sudo dnf install podman -y Now we must login to our container registry using sudo so that podman running as root can download our container image: (We will use the credentials that were generated earlier in this quick start.) sudo podman login rhel10demo.azurecr.io -u mytoken -p roF8leBAMlfeTS2Jb75WmaK/Z0WATcRK/sNowu6Str+ACRAZruef and then we can run the following command to pull down our image mode image and deploy it to this system: sudo podman run --privileged --pid=host -v /var/lib/containers:/var/lib/containers -v /:/target -v /home/core/.ssh/authorized_keys:/bootc_authorized_ssh_keys/root rhel10demo.azurecr.io/mssql-image-mode:azure bootc install to-existing-root --acknowledge-destructive --root-ssh-authorized-keys /bootc_authorized_ssh_keys/root At this point, when the above command completes successfully, you can run: sudo systemctl reboot And when you reboot, your RHEL system will now be running in Image Mode! Let’s go ahead and log back in and see what’s different: ssh -i ~/.ssh/linuxkey core@ip.ip.ip.ip Let’s start by trying to create a file in /opt as root – something you could normally do on package mode RHEL: sudo touch /opt/hello.txt If you are on an image mode system, you’ll see this output: touch: cannot touch 'hello.txt': Read-only file system This is because the files in the container image are immutable and the only way to change them is to change the image and reboot into the image! This provides a significant new layer of security in that operating system files cannot be easily modified. You are still able to edit files in /etc and in home directories. This image has Microsoft SQL Server installed and we can verify this by running: sudo systemctl status mssql-server You should see output similar to: ○ mssql-server.service - Microsoft SQL Server Database Engine Loaded: loaded (/usr/lib/systemd/system/mssql-server.service; disabled; preset: disabled) Active: inactive (dead) Docs: https://docs.microsoft.com/en-us/sql/linux Given that we want all virtual machines based on this image to have SQL Server running by default, let’s make that change to the container image. To do this, we’ll need to exit this shell and ssh into the container builder machine from earlier. On that machine, run: cd rhel-mssql-bootc And then use this command to add “RUN systemctl enable mssql-server.service” to the end of Containerfile.azure: echo >> Containerfile.azure && echo "RUN systemctl enable mssql-server.service" >> Containerfile.azure Once this is done, we need to rebuild the image, tag the rebuilt image, and push the new image to our container registry: (Remember to replace “rhel10demo” with your registry name.) podman build --build-arg "sshpubkey=$(cat ~/.ssh/linuxkey.pub)" --no-cache -f Containerfile.azure podman images REPOSITORY TAG IMAGE ID CREATED SIZE <none> <none> b7322184d64e 9 seconds ago 3.67 GB rhel10demo.azurecr.io/mssql-image-mode azure 237947455bd9 53 minutes ago 3.67 GB quay.io/mrguitar/rhel-mssql-bootc latest e36c92d89714 4 days ago 3.5 GB podman tag b7322184d64e rhel10demo.azurecr.io/mssql-image-mode:azure cp etc/ostree/auth.json /run/user/1000/containers podman push rhel10demo.azurecr.io/mssql-image-mode:azure Once this is done, we can exit this machine and log back on to our image mode machine and run the following: sudo bootc upgrade sudo systemctl reboot And watch the magic as the box reboots! Now when you ssh in again, run: sudo systemctl status mssql-server This time, you should get output showing that SQL Server is running. Upon updating and rebooting all VMs based on this image, they’ll now be running SQL Server. At this point, we can run the SQL Server demo by running: sudo PATH=$PATH:/opt/mssql/bin/ /opt/mssql_demo.sh This demo shows that the SQL Server is working on our RHEL machine running in Image Mode. Any further changes that you want to make to this box can be pushed by changing the container image in the registry and calling a bootc upgrade. Image Mode RHEL also ships with a timer that allows these boxes to check for updates on the weekend. This can be configured so that not all of your image mode RHEL boxes update at the same time. With Image Mode RHEL, you don’t have to worry about system drift as you are always running off of a known image. Also, you’ll probably find yourself saving a lot of time by not having to log on to lots of systems. If you need to spin up 1,000 VMs with the same image, you can easily do that with the tooling we’ve shown you today on top of Azure!805Views1like0Comments