azure linux

45 Topics

Applying Site Reliability Engineering to Autonomous AI Agents
If you practice SRE, you already have a mental model for running reliable production systems. You define SLOs. You track error budgets. You use circuit breakers to stop cascading failures. You run chaos experiments to find weaknesses before customers do. You treat every operational decision as a tradeoff between reliability and velocity. That mental model transfers directly to AI agents. It just needs four new ideas. In the Agent Governance Toolkit: Architecture Deep Dive, Policy Engines, Trust, and SRE for AI Agents, we covered Agent SRE briefly as one of AGT's nine packages: SLOs, error budgets, circuit breakers, chaos engineering, and progressive delivery, adapted from the patterns your SRE team already applies to microservices. Several teams asked for the full story. This is it. Agent SRE is one of the more novel parts of the toolkit. The policy engine, zero-trust identity, and execution sandboxing have clear analogs in existing security practice. Agent SRE explores newer ground. Established patterns for defining SLOs for AI agent behavior, building chaos experiments for LLM provider failures, or applying error budgets to agent autonomy are still emerging across the industry. We built these capabilities because running agents in production without them is the equivalent of running a fleet of microservices without circuit breakers, health checks, or an on-call runbook. This post is for SRE teams, platform engineers, and anyone responsible for running AI agents in production. You do not need to be an AI specialist. If you know what a burn rate is, you are ready for this. The Problem: Agents Fail in Ways Your Existing SRE Tooling Cannot See When a service fails, your observability stack tells you: latency went up, error rate crossed the SLO threshold, the circuit breaker opened. You page the on-call engineer. They look at traces and find the slow database query. When an AI agent fails, your observability stack is silent. The agent returned HTTP 200. Latency was normal. Error rate was zero. But the agent quietly approved a transaction it was not authorized to approve, hallucinated a database path and wrote to the wrong table, or got stuck in a reasoning loop that consumed $800 of LLM API budget before anyone noticed. These are not infrastructure failures. They are behavioral failures. And they are invisible to monitoring tools built for stateless, deterministic services, because those tools only watch for crashes and timeouts. They do not watch for wrong behavior. This gap is the problem Agent SRE was designed to solve. The solution borrows everything from the SRE playbook and adds one concept that extends it: the Safety SLI. The Safety SLI: A New Reliability Dimension Traditional SLIs measure system behavior from the user's perspective: latency, availability, error rate, throughput. They answer: did the service respond correctly? For AI agents, correctness is not enough. An agent that responds correctly but acts outside its authorized scope has not succeeded. It has failed in a way that none of your existing SLIs can detect. The Safety SLI answers a different question: did the agent act within policy? from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import PolicyCompliance # Define a safety SLO: 99% of agent actions must comply with policy safety_slo = SLO( name="safety-compliance", indicators=[ PolicyCompliance( target=0.99, window="7d", ), ], error_budget=ErrorBudget( total=0.01, # 1% budget (1 - 0.99 target) window_seconds=2592000, # 30-day window burn_rate_alert=2.0, # warn at 2x sustainable rate burn_rate_critical=5.0, # page at 5x sustainable rate ), ) When an agent's policy compliance rate drops below 99%, the error budget starts burning. The ErrorBudget tracks consumption automatically and exposes burn rate alerts through its firing_alerts() method. When the budget is exhausted, the configured exhaustion_action determines the system response: from agent_sre.slo.objectives import ExhaustionAction # Configure what happens when error budget is exhausted safety_slo = SLO( name="safety-compliance", indicators=[PolicyCompliance(target=0.99, window="7d")], error_budget=ErrorBudget( total=0.01, window_seconds=2592000, burn_rate_alert=2.0, # fires at 2x sustainable burn rate burn_rate_critical=5.0, # fires at 5x sustainable burn rate exhaustion_action=ExhaustionAction.CIRCUIT_BREAK, # suspend agent when budget is gone ), ) # In your monitoring loop, check for firing alerts alerts = safety_slo.error_budget.firing_alerts() for alert in alerts: print(f"Alert firing: {alert.name} (severity: {alert.severity})") # Check budget status print(f"Budget remaining: {safety_slo.error_budget.remaining_percent:.1f}%") print(f"Current burn rate: {safety_slo.error_budget.burn_rate():.2f}x") print(f"Exhausted: {safety_slo.error_budget.is_exhausted}") This is the governance dial from the other direction. The error budget is not just a metric: it is the mechanism that drives agent autonomy decisions. An agent with a clean 30-day safety record earns autonomy. An agent whose budget is burning at 5x the sustainable rate triggers a critical alert, and when the budget is exhausted, the exhaustion_action fires: ALERT, THROTTLE, FREEZE_DEPLOYMENTS, or CIRCUIT_BREAK. The graduated response mirrors what SRE teams already do with service SLOs, applied to agent behavior. There are multiple SLI dimensions built into Agent SRE. Safety SLIs and Performance SLIs track different aspects of the same agent: SLI Type What It Measures Target Pattern When Budget Burns Safety SLI PolicyCompliance -- fraction of actions within authorized scope >= 99% Restrict capabilities, increase human oversight Performance SLI TaskSuccessRate, ResponseLatency, CostPerTask Configurable per workload Alert, throttle, or circuit-break LLM provider Additional built-in indicators include ToolCallAccuracy, DelegationChainDepth, HallucinationRate, and CalibrationDeltaSLI. Both SLOs feed into the same error budget dashboard. An agent can have excellent performance but a degrading safety record, or perfect safety compliance and terrible cost efficiency. You need both dimensions to understand whether an agent is production-ready. Circuit Breakers: Governing Agent Failure Modes That Don't Exist in Microservices Circuit breakers for services protect against one failure mode: a backend that is slow or unreachable. The pattern is CLOSED -> OPEN -> HALF_OPEN. You know it well. Agent SRE implements the same state machine for failure modes that are specific to autonomous reasoning systems and do not exist in traditional microservice architectures: from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker from agent_sre.chaos.engine import FaultType config = CircuitBreakerConfig( failure_threshold=5, # Open after 5 failures in the window recovery_timeout_seconds=60, # Stay OPEN for 60s before HALF_OPEN half_open_max_calls=3, # Allow 3 probes in HALF_OPEN ) breaker = CircuitBreaker(agent_id="analyst-agent-001", config=config) # Failure modes tracked by the circuit breaker: tracked_faults = [ FaultType.POLICY_BYPASS, # Agent exceeds authorized scope FaultType.ERROR_INJECTION, # Upstream model API fails FaultType.TIMEOUT_INJECTION, # Tool calls exceed time budget FaultType.TRUST_PERTURBATION, # Agent trust score falls below threshold FaultType.DEADLOCK_INJECTION, # Agent stuck in iterative reasoning ] Each failure mode has different circuit-breaking semantics: Failure Mode What Triggers It Circuit-Break Behavior Policy bypass Action denied by policy engine Count toward threshold; log with full context LLM provider error HTTP 5xx from model API Immediately open; route to fallback model if configured Tool timeout Tool call exceeds timeout_ms Count toward threshold; cancel in-flight call Trust score degradation Agent trust score drops below configured floor Open; escalate to Ring 3 (untrusted) until score recovers Reasoning loop / deadlock Token or iteration count exceeds budget Open; trigger human review before resuming The reasoning loop breaker deserves attention. A microservice cannot get stuck reasoning. An AI agent absolutely can, and when it does, the failure is not an error code: it is an agent that keeps calling tools, consuming tokens, and generating audit events indefinitely. The circuit breaker detects this pattern from the iteration count and token budget and terminates the loop: # Reasoning loop detection configuration loop_detection_config = { "max_iterations": 15, # Hard stop after 15 reasoning steps "max_tokens_per_session": 50000, # Hard stop on token consumption "repetition_threshold": 0.85, # Stop if >85% of recent actions repeat prior ones "on_detection": "circuit_break_and_escalate", } The state machine behaves identically to what you know from Hystrix or Resilience4j. What changes is the definition of "failure." CLOSED (serving) | | failure_threshold crossed for any tracked fault v OPEN (rejecting -- agent action denied, fallback or human-in-loop fires) | | recovery_timeout expires v HALF_OPEN (probe -- limited requests allowed through) | |-- success_threshold met --> CLOSED |-- any failure --> OPEN (reset timeout) Chaos Engineering for Agents: Fault Injection for Autonomous Systems The only way to know if your agent system is resilient is to break it intentionally. Traditional chaos engineering targets infrastructure: kill a pod, inject network latency, saturate a disk. Agent chaos engineering targets the failure modes specific to autonomous reasoning systems. Agent SRE ships fault injection templates that cover the failure modes teams consistently underestimate until they hit production: from agent_sre.chaos.engine import ChaosExperiment, Fault, FaultType # Experiment 1: LLM provider degrades -- model returns valid responses but with # increased latency and occasional malformed outputs experiment = ChaosExperiment( name="llm-degradation-resilience", target_agent="analyst-agent-001", description="Test agent behavior under degraded LLM provider", faults=[ Fault.latency_injection(target="llm-provider", delay_ms=8000), Fault.error_injection(target="llm-provider", rate=0.05), ], duration_seconds=300, ) # Experiment 2: Trust score manipulation -- simulates an agent receiving # messages from a peer with a spoofed trust score trust_experiment = ChaosExperiment( name="trust-manipulation-resilience", target_agent="orchestrator-001", faults=[ Fault( fault_type=FaultType.TRUST_PERTURBATION, target="did:mesh:orchestrator-001", params={"spoofed_score": 950}, ), ], duration_seconds=120, ) # Experiment 3: Tool timeout cascade -- multiple tools time out simultaneously, # testing whether the agent abandons gracefully or enters a reasoning loop cascade_experiment = ChaosExperiment( name="tool-timeout-cascade", target_agent="analyst-agent-001", faults=[ Fault.timeout_injection(target="database.read", delay_ms=30000), Fault.timeout_injection(target="api.call", delay_ms=30000), ], duration_seconds=180, ) # Run the experiment experiment.start() # ... inject faults during agent execution ... resilience = experiment.calculate_resilience( baseline_success_rate=0.95, experiment_success_rate=0.87, recovery_time_ms=48000, ) experiment.complete(resilience=resilience) print(f"Resilience score: {resilience.overall}/100 -- {'PASSED' if resilience.passed else 'FAILED'}") Additional fault types built into the chaos engine cover: prompt injection attempts, privilege escalation, data exfiltration attempts, identity spoofing, deadlock injection, and contradictory instruction scenarios. Each maps to a FaultType enum value and can be composed into multi-fault experiments. Important: The chaos engine records that a fault was injected and triggers the governance response pipeline. Actual infrastructure-level fault injection (network partition, process kill) should be implemented using your existing chaos tooling (Chaos Mesh, Gremlin, Azure Chaos Studio, or similar). Agent SRE governs the agent's behavioral response to faults; it does not own infrastructure manipulation. These two layers are designed to compose. Each chaos experiment produces a structured resilience score via calculate_resilience(), which compares baseline and experiment success rates. A score of 90+ with passed=True means the agent maintained at least 90% of its baseline performance under fault conditions. Teams use this to set minimum resilience thresholds for production readiness. Replay Debugging: Reproduce Behavioral Failures Exactly Infrastructure incidents are reproducible because infrastructure is deterministic. AI agent incidents are hard to reproduce because agent behavior depends on model state, context window content, and the sequence of tool call results, none of which are preserved by default after a session ends. Agent SRE's replay engine records every agent session as a replayable artifact: the full trace at each step, every tool call with its inputs and outputs, every policy evaluation with its decision, and every trust score at the time of each inter-agent message. from agent_sre.replay.capture import TraceStore from agent_sre.replay.engine import ReplayEngine, ReplayMode # Traces are captured automatically when SRE tracing is active store = TraceStore( backend="azure_blob", retention_days=30, ) # When an incident occurs, replay the session exactly engine = ReplayEngine(store=store) # Full replay: re-run the session against the same recorded inputs # Uses recorded tool outputs -- no live tool calls -- so replay is deterministic result = await engine.replay( trace_id="trace_2026_05_a7f3b2", mode=ReplayMode.FULL, ) for step in result.steps: print(f"Step {step.index}: {step.action} -> {step.decision}") # Divergence analysis: replay with a policy change applied # Shows exactly which actions would have been blocked under the new policy diff_result = await engine.diff( trace_id="trace_2026_05_a7f3b2", policy_override="policies/stricter-v2.yaml", ) for diff in diff_result.diffs: if diff.description: print(f"Step {diff.span_name}: was {diff.original}, " f"would be {diff.replayed} under new policy") The divergence analysis is the feature teams use most. When a policy change is proposed, you replay recent production traces against the new policy to see how many actions would have been blocked, which sessions would have failed, and what the error budget impact would have been. Policy changes stop being guesswork. Progressive Delivery: Safely Rolling Out New Agent Capabilities When you ship a new service version, you do not send it to all traffic at once. You use canary deployments, feature flags, or traffic splitting. You watch the SLOs. If they degrade, you roll back. Agent SRE brings the same discipline to agent capability rollout. When you expand an agent's authorized scope, giving it write access it did not have, connecting it to a new tool, or raising its trust floor, you do not expand to the full fleet immediately. You expand progressively, with automated SLO gates controlling each stage. from agent_sre.delivery.rollout import ( AnalysisCriterion, CanaryRollout, RollbackCondition, RolloutStep, ) rollout = CanaryRollout( name="database-write-capability", steps=[ RolloutStep( name="canary", weight=0.05, # 5% of agents get the new capability duration_seconds=86400, # 24 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.995), AnalysisCriterion(metric="performance_sli", threshold=0.90), AnalysisCriterion( metric="error_budget_consumed", threshold=0.10, comparator="lte", # canary can burn at most 10% ), ], ), RolloutStep( name="early-adopters", weight=0.25, # 25% traffic duration_seconds=172800, # 48 hours analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.88), ], ), RolloutStep( name="general-availability", weight=1.0, # 100% traffic duration_seconds=604800, # 1 week of full observation analysis=[ AnalysisCriterion(metric="safety_sli", threshold=0.990), AnalysisCriterion(metric="performance_sli", threshold=0.85), ], ), ], rollback_conditions=[ RollbackCondition(metric="safety_sli", threshold=0.95, comparator="lte"), ], ) # Start the rollout -- SLO gates evaluate at each step rollout.start() # Advance to next step when analysis criteria pass if rollout.advance(): print(f"Advanced to step: {rollout.current_step.name}") print(f"Progress: {rollout.progress_percent:.0f}%") The SLO gate at each step is the same mechanism as a CI/CD quality gate, but measured on live production behavior rather than test results. An agent capability that degrades the safety SLI during canary does not promote to the next step. If a RollbackCondition fires, the rollout rolls back automatically. This is the mechanism that makes it operationally safe to expand agent autonomy: every expansion is measurable, every measurement gates the next expansion, and rollback is automatic. Health Checks and Backpressure Traditional health checks answer: is the service alive? For agents, alive is not enough. A healthy agent is one that is alive, operating within policy, consuming resources within budget, and maintaining a trust score above the Ring threshold it was assigned. # Agent health check covering multiple dimensions health = await agent_health_check( agent_id="analyst-agent-001", dimensions=[ "liveness", # Is the agent process running? "policy_compliance", # Is safety SLI above threshold? "trust_score", # Is trust score above Ring floor? "resource_budget", # Is token/API spend within limits? "tool_availability", # Are the tools the agent needs reachable? ], ) # health.status: "healthy" | "degraded" | "unhealthy" # health.dimensions: per-dimension pass/fail with values # health.recommended_action: "none" | "restrict" | "suspend" | "terminate" When health checks report degradation, backpressure controls engage before the circuit breaker opens. Backpressure is the earlier, softer response: accept fewer concurrent tasks, reject low-priority work, drain in-flight tasks gracefully before the situation escalates. # Backpressure configuration backpressure_config = { "backpressure_threshold": 0.80, # Engage when resource utilization > 80% "max_concurrent": 5, # Hard cap on simultaneous agent tasks "priority_shedding": True, # Drop low-priority tasks first "drain_timeout_seconds": 30, # Allow in-flight tasks to complete } The ordering matters: backpressure first, then circuit breaker, then suspension. Each stage is recoverable. Each stage preserves more agent state than the next. The SRE principle of graduated response applies to agents exactly as it applies to services. Observability: Governance Metrics Flow Into Your Existing Stack Agent SRE does not ask you to adopt a new observability platform. Governance metrics are exported through the same adapters your infrastructure monitoring already uses, including OpenTelemetry, Prometheus, Datadog, and others. from agent_sre.tracing.exporters import configure_exporters configure_exporters( backends=[ {"type": "prometheus", "endpoint": "http://prometheus:9090"}, {"type": "opentelemetry", "endpoint": "http://otel-collector:4317"}, ], include_metrics=[ "slo.safety_sli", # Per-agent safety compliance rate "slo.error_budget_remaining", # Error budget in percentage "slo.burn_rate", # Current burn rate vs sustainable "circuit_breaker.state", # CLOSED / OPEN / HALF_OPEN "circuit_breaker.failure_count", "trust_score.current", # Agent trust score (0-1000) "trust_score.ring", # Current execution ring "chaos.experiments_run", # Chaos experiment telemetry "health.status", # Aggregate health status "backpressure.load", # Current load vs threshold ], ) Key governance metrics available in your existing dashboards: Metric What It Tells You Alert Condition slo.safety_sli Fraction of agent actions within policy < 0.99 slo.burn_rate Rate at which error budget is consumed > 2.0 (warn), > 5.0 (page) slo.error_budget_remaining Budget left for the SLO window < 20% circuit_breaker.state Current breaker state per agent OPEN or HALF_OPEN trust_score.ring Execution ring (privilege level) Ring 3 (untrusted) health.status Aggregate health across all dimensions degraded or unhealthy If you are already running Grafana dashboards for your services, a governance dashboard for your agent fleet is a new data source and a new set of panels, not a new monitoring stack. The SRE Mental Model for Agents: Four New Concepts Everything in Agent SRE is built on the SRE mental model you already have, extended with four concepts that adapt traditional reliability thinking for autonomous systems: Traditional SRE Agent SRE Equivalent What Changes Latency SLI Safety SLI Correctness of *action*, not speed of *response* Error budget Autonomy budget Burns on policy violations, not just errors Circuit breaker Behavioral circuit breaker Opens on wrong *behavior*, not just failure codes Canary deployment Capability rollout Rolls out *scope*, not just code The governance insight is that error budgets work in both directions for agents. A service's error budget only decreases. An agent's autonomy is also a budget: it grows when the safety SLI is strong and shrinks when it degrades. The error budget mechanism becomes the operational mechanism for expanding and contracting agent autonomy in response to evidence, which is exactly what regulated industries and risk-averse enterprise teams need before they will trust an autonomous agent with consequential actions. Getting Started with Agent SRE pip install agent-sre A minimal Agent SRE integration requires three things: a safety SLO definition, a circuit breaker, and a health check. The progressive delivery and chaos engineering features layer on top when you are ready for them. from agent_sre import SLO, ErrorBudget from agent_sre.slo.indicators import TaskSuccessRate from agent_sre.cascade.circuit_breaker import CircuitBreakerConfig, CircuitBreaker # Step 1: Define your safety SLO slo = SLO( name="production-safety", indicators=[TaskSuccessRate(target=0.99, window="24h")], error_budget=ErrorBudget(total=0.01, burn_rate_alert=2.0, burn_rate_critical=5.0), ) # Step 2: Configure a circuit breaker breaker_config = CircuitBreakerConfig( failure_threshold=5, recovery_timeout_seconds=60, half_open_max_calls=3, ) breaker = CircuitBreaker(agent_id="my-agent", config=breaker_config) # Step 3: Wire into your existing agent loop async def governed_agent_loop(agent, task): # Check health first if not await agent_is_healthy(agent.id): return {"error": "agent suspended", "reason": "health check failed"} # Run within circuit breaker protection async with breaker: result = await agent.run(task) slo.record_event(good=result.policy_compliant) return result The quickstart in the repository walks through a complete setup with safety SLOs, circuit breakers, and a Prometheus dashboard export in under 50 lines. Why This Matters Most AI observability tools today focus on what you might call model quality: hallucination rate, latency, token cost, task completion. These are useful metrics. They are not SRE metrics. They do not answer whether the agent acted within its authorized scope, whether its behavioral error budget is burning at a dangerous rate, or whether it would survive the LLM provider going down. Agent SRE answers those questions using the operational vocabulary that SRE teams already understand: SLOs, error budgets, circuit breakers, chaos experiments, and health checks. The goal is not to replace your observability stack. It is to make agent governance visible inside it. The reliability of an autonomous agent is not a property of the model. It is a property of the governance infrastructure around it. Agent SRE is that infrastructure. Resources GitHub: github.com/microsoft/agent-governance-toolkit Install: pip install agent-sre Tutorials: 40+ tutorials including dedicated Agent SRE walkthroughs for SLO setup, chaos experiments, and progressive delivery Architecture reference: ARCHITECTURE.md OWASP compliance mapping: OWASP-COMPLIANCE.md -- Agent SRE addresses ASI-08 (Cascading Failures) directly through circuit breakers and SLO-based fault detection Part 1 -- Runtime governance: Policy engines, trust, and SRE overview Part 2 -- Shift-left governance: Catching violations before production Part 3 -- Post-hoc accountability: After the agent acts The Agent Governance Toolkit is an open-source project released under the MIT License. All features described in this post are available in the public repository. The `agent-sre` package is currently in public preview; APIs may change before general availability. Questions about Agent SRE in your environment? Open an issue at aka.ms/agent-governance-toolkit or start a discussion in the comments below.
mosiddi
May 19, 2026 Place Linux and Open Source Blog
39Views
0likes
0Comments
Decoupling Memory from Startup Time in AKS Sandbox Pods
What if a 96GB sandboxed pod could start as fast as a 2GB one? Before recent improvements in AKS Pod Sandboxing, large-memory pods could take over a minute longer to start than smaller ones. For customers running latency-sensitive, autoscaling, AI/ML, or bursty workloads, that startup delay directly impacted scale-out responsiveness, job completion time, and overall cluster efficiency. AKS Pod Sandboxing provides strong workload isolation by running pods inside lightweight virtual machines. This model is especially valuable for security-sensitive, untrusted, or multi-tenant workloads, but it came with a tradeoff: memory size directly impacted startup latency. With recent updates to the Azure Linux kernel used by AKS on Microsoft Hypervisor (MSHV), AKS has significantly improved startup time for large-memory sandboxed pods. This article explains what changed, why it matters, and what AKS customers should expect in practice. The Problem: Large-Memory Pod Startup Was Expensive Before this change, Kata-based pod sandboxes on AKS using the Microsoft Hypervisor (MSHV) followed an eager memory allocation model: When a pod sandbox VM was created, all memory specified in the pod resource request was committed up front on the host. For example: a pod requesting 32 GB, 64 GB, or 96 GB of memory forced the host to allocate and pin those virtual memory pages in physical memory before the VM could boot. As a result, sandbox startup time scaled linearly with memory size. Measurements showed startup times growing quickly as memory increased: Pod Sandbox Memory E2E Startup Time (Before) 32 GB ~21 seconds 64 GB ~41 seconds 96 GB ~62 seconds This led to: Slower startup and scale-out for memory-heavy workloads. Inefficient node utilization due to wasted memory reserved but unused at startup. What Changed: Deferred Page Allocation in MSHV Host Kernel With deferred page allocation, the kernel no longer commits all virtual machine memory at sandbox creation time. The pod sandbox VM boots with a small initial memory footprint. Host memory pages are committed lazily, only when the guest faults them. The total available memory remains bounded by the pod memory limit defined in the pod specification. This behavior aligns with how KVM-based systems handle guest memory today but is implemented for MSHV in Azure Linux. In short: memory is provisioned on demand, not up front. & After) Results 1. Pod Startup Time Is Now Effectively Constant The most visible benefit for AKS customers is dramatically improved pod startup time for large-memory pods. With deferred page allocation enabled, startup time becomes approximately O(1) with respect to memory size: Pod Sandbox Memory E2E Startup Time (After) 32 GB ~3 seconds 64 GB ~3 seconds 96 GB ~3.5 seconds ~7x faster startup for 32 GB pods ~12x faster startup for 64 GB pods ~17x faster startup for 96 GB pods 2. Higher Density and Better Memory Utilization Deferred page allocation also reduces wasted reserved memory at pod start. This allows AKS nodes to safely oversubscribe memory for cold pods, pack more sandboxed pods per node, and improve overall workload density and infrastructure efficiency. Tradeoff: First-Touch Page Fault Cost Deferred page allocation introduces a first-touch cost: when a workload accesses a memory page for the first time, a page fault triggers host allocation. This cost is incurred once per page. After memory is populated, steady-state performance matches eager allocation in benchmarks. For most workloads, especially those that ramp memory gradually or benefit from faster startup, the improvement outweighs this one-time cost. What AKS Pod Sandboxing Customers Need To Do Here's the good part: No changes are required for workloads to benefit from this improvement. However, customers are encouraged to: Specify realistic memory requests and limits. Take advantage of improved startup behavior for scale-out scenarios. Deferred page allocation is available in AKS Pod Sandboxing on AKS Azure Linux version 202603.18.1 or later, running kernel-mshv 6.6.121 or newer.
RoaaSakr
May 14, 2026 Place Linux and Open Source Blog
195Views
0likes
0Comments
Inspektor Gadget Completes Its First Independent Security Audit
Inspektor Gadget, the CNCF eBPF tool for Kubernetes and Linux observability, has completed its first independent security audit, conducted by Shielder and coordinated by OSTIF and CNCF. The audit found two Medium and one Low-severity issue, now patched in release v0.50.1. Learn what the auditors discovered, the hardening recommendations the maintainers are acting on, and why this milestone matters for the open source community.
Brian Benz
May 08, 2026 Place Linux and Open Source Blog
179Views
0likes
0Comments
Run OpenClaw Agents on Azure Linux VMs (with Secure Defaults)
Many teams want an enterprise-ready personal AI assistant, but they need it on infrastructure they control, with security boundaries they can explain to IT. That is exactly where OpenClaw fits on Azure. OpenClaw is a self-hosted, always-on personal agent runtime you run in your enterprise environment and Azure infrastructure. Instead of relying only on a hosted chat app from a third-party provider, you can deploy, operate, and experiment with an agent on an Azure Linux VM you control — using your existing GitHub Copilot licenses, Azure OpenAI deployments, or API plans from OpenAI, Anthropic Claude, Google Gemini, and other model providers you already subscribe to. Once deployed on Azure, you can interact with an OpenClaw agent through familiar channels like Microsoft Teams, Slack, Telegram, WhatsApp, and many more! For Azure users, this gives you a practical middle ground: modern personal-agent workflows on familiar Azure infrastructure. What is OpenClaw, and how is it different from ChatGPT/Claude/chat apps? OpenClaw is a self-hosted personal agent runtime that can be hosted on Azure compute infrastructure. How it differs: ChatGPT/Claude apps are primarily hosted chat experiences tied to one provider's models OpenClaw is an always-on runtime you operate yourself, backed by your choice of model provider — GitHub Copilot, Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, and others OpenClaw lets you keep the runtime boundary in your own Azure VM environment within your Azure enterprise subscription In practice, OpenClaw is useful when you want a persistent assistant for operational and workflow tasks, with your own infrastructure as the control point. You bring whatever model provider and API plan you already have — OpenClaw connects to it. Why Azure Linux VMs? Azure Linux VMs are a strong fit because they provide: A suitable host machine for the OpenClaw agent to run on Enterprise-friendly infrastructure and identity workflows Repeatable provisioning via the Azure CLI Network hardening with NSG rules Managed SSH access through Azure Bastion instead of public SSH exposure How to Set Up OpenClaw on an Azure Linux VM This guide sets up an Azure Linux VM, applies NSG (Network Security Group) hardening, configures Azure Bastion for managed SSH access, and installs an always-on OpenClaw agent within the VM that you can interact with through various messaging channels. What you'll do Create Azure networking (VNet, subnets, NSG) and compute resources with the Azure CLI Apply Network Security Group rules so VM SSH is allowed only from Azure Bastion Use Azure Bastion for SSH access (no public IP on the VM) Install OpenClaw on the Azure VM Verify OpenClaw installation and configuration on the VM What you need An Azure subscription with permission to create compute and network resources Azure CLI installed (install steps) An SSH key pair (the guide covers generating one if needed) ~20–30 minutes Configure deployment Step 1: Sign in to Azure CLI az login # Select a suitable Azure subscription during Azure login az extension add -n ssh # SSH extension is required for Azure Bastion SSH The ssh extension is required for Azure Bastion native SSH tunneling. Step 2: Register required resource providers (one-time) Register required Azure Resource Providers (one time registration): az provider register --namespace Microsoft.Compute az provider register --namespace Microsoft.Network Verify registration. Wait until both show Registered. az provider show --namespace Microsoft.Compute --query registrationState -o tsv az provider show --namespace Microsoft.Network --query registrationState -o tsv Step 3: Set deployment variables Set the deployment environment variables that will be needed throughout this guide. RG="rg-openclaw" LOCATION="westus2" VNET_NAME="vnet-openclaw" VNET_PREFIX="10.40.0.0/16" VM_SUBNET_NAME="snet-openclaw-vm" VM_SUBNET_PREFIX="10.40.2.0/24" BASTION_SUBNET_PREFIX="10.40.1.0/26" NSG_NAME="nsg-openclaw-vm" VM_NAME="vm-openclaw" ADMIN_USERNAME="openclaw" BASTION_NAME="bas-openclaw" BASTION_PIP_NAME="pip-openclaw-bastion" Adjust names and CIDR ranges to fit your environment. The Bastion subnet must be at least /26. Step 4: Select SSH key Use your existing public key if you have one: SSH_PUB_KEY="$(cat ~/.ssh/id_ed25519.pub)" If you don't have an SSH key yet, generate one: ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_ed25519 -C "you@example.com" SSH_PUB_KEY="$(cat ~/.ssh/id_ed25519.pub)" Step 5: Select VM size and OS disk size VM_SIZE="Standard_B2as_v2" OS_DISK_SIZE_GB=64 Choose a VM size and OS disk size available in your subscription and region: Start smaller for light usage and scale up later Use more vCPU/RAM/disk for heavier automation, more channels, or larger model/tool workloads If a VM size is unavailable in your region or subscription quota, pick the closest available SKU List VM sizes available in your target region: az vm list-skus --location "${LOCATION}" --resource-type virtualMachines -o table Check your current vCPU and disk usage/quota: az vm list-usage --location "${LOCATION}" -o table Deploy Azure resources Step 1: Create the resource group The Azure resource group will contain all of the Azure resources that the OpenClaw agent needs. az group create -n "${RG}" -l "${LOCATION}" Step 2: Create the network security group Create the NSG and add rules so only the Bastion subnet can SSH into the VM. az network nsg create \ -g "${RG}" -n "${NSG_NAME}" -l "${LOCATION}" # Allow SSH from the Bastion subnet only az network nsg rule create \ -g "${RG}" --nsg-name "${NSG_NAME}" \ -n AllowSshFromBastionSubnet --priority 100 \ --access Allow --direction Inbound --protocol Tcp \ --source-address-prefixes "${BASTION_SUBNET_PREFIX}" \ --destination-port-ranges 22 # Deny SSH from the public internet az network nsg rule create \ -g "${RG}" --nsg-name "${NSG_NAME}" \ -n DenyInternetSsh --priority 110 \ --access Deny --direction Inbound --protocol Tcp \ --source-address-prefixes Internet \ --destination-port-ranges 22 # Deny SSH from other VNet sources az network nsg rule create \ -g "${RG}" --nsg-name "${NSG_NAME}" \ -n DenyVnetSsh --priority 120 \ --access Deny --direction Inbound --protocol Tcp \ --source-address-prefixes VirtualNetwork \ --destination-port-ranges 22 The rules are evaluated by priority (lowest number first): Bastion traffic is allowed at 100, then all other SSH is blocked at 110 and 120. Step 3: Create the virtual network and subnets Create the VNet with the VM subnet (NSG attached), then add the Bastion subnet. az network vnet create \ -g "${RG}" -n "${VNET_NAME}" -l "${LOCATION}" \ --address-prefixes "${VNET_PREFIX}" \ --subnet-name "${VM_SUBNET_NAME}" \ --subnet-prefixes "${VM_SUBNET_PREFIX}" # Attach the NSG to the VM subnet az network vnet subnet update \ -g "${RG}" --vnet-name "${VNET_NAME}" \ -n "${VM_SUBNET_NAME}" --nsg "${NSG_NAME}" # AzureBastionSubnet — name is required by Azure az network vnet subnet create \ -g "${RG}" --vnet-name "${VNET_NAME}" \ -n AzureBastionSubnet \ --address-prefixes "${BASTION_SUBNET_PREFIX}" Step 4: Create the Virtual Machine Create the VM with no public IP. SSH access for OpenClaw configuration will be exclusively through Azure Bastion. az vm create \ -g "${RG}" -n "${VM_NAME}" -l "${LOCATION}" \ --image "Canonical:ubuntu-24_04-lts:server:latest" \ --size "${VM_SIZE}" \ --os-disk-size-gb "${OS_DISK_SIZE_GB}" \ --storage-sku StandardSSD_LRS \ --admin-username "${ADMIN_USERNAME}" \ --ssh-key-values "${SSH_PUB_KEY}" \ --vnet-name "${VNET_NAME}" \ --subnet "${VM_SUBNET_NAME}" \ --public-ip-address "" \ --nsg "" --public-ip-address "" prevents a public IP from being assigned. --nsg "" skips creating a per-NIC NSG (the subnet-level NSG created earlier handles security). Reproducibility: The command above uses latest for the Ubuntu image. To pin a specific version, list available versions and replace latest: az vm image list \ --publisher Canonical --offer ubuntu-24_04-lts \ --sku server --all -o table Step 5: Create Azure Bastion Azure Bastion provides secure-managed SSH access to the VM without exposing a public IP. Bastion Standard SKU with tunneling is required for CLI-based "az network bastion ssh" command. az network public-ip create \ -g "${RG}" -n "${BASTION_PIP_NAME}" -l "${LOCATION}" \ --sku Standard --allocation-method Static az network bastion create \ -g "${RG}" -n "${BASTION_NAME}" -l "${LOCATION}" \ --vnet-name "${VNET_NAME}" \ --public-ip-address "${BASTION_PIP_NAME}" \ --sku Standard --enable-tunneling true Bastion provisioning typically takes 5–10 minutes but can take up to 15–30 minutes in some regions. Step 6: Verify Deployments After all resources are deployed, your resource group should look like the following: Install OpenClaw Step 1: SSH into the VM through Azure Bastion VM_ID="$(az vm show -g "${RG}" -n "${VM_NAME}" --query id -o tsv)" az network bastion ssh \ --name "${BASTION_NAME}" \ --resource-group "${RG}" \ --target-resource-id "${VM_ID}" \ --auth-type ssh-key \ --username "${ADMIN_USERNAME}" \ --ssh-key ~/.ssh/id_ed25519 Step 2: Install OpenClaw (in the Bastion SSH shell) curl -fsSL https://openclaw.ai/install.sh | bash The installer installs Node LTS and dependencies if not already present, installs OpenClaw, and launches the OpenClaw onboarding wizard. For more information, see the open source OpenClaw install docs. OpenClaw Onboarding: Choosing an AI Model Provider During OpenClaw onboarding, you'll choose the AI model provider for the OpenClaw agent. This can be GitHub Copilot, Azure OpenAI, OpenAI, Anthropic Claude, Google Gemini, or another supported provider. See the open source OpenClaw install docs for details on choosing an AI model provider when going through the onboarding wizard. Most enterprise Azure teams already have GitHub Copilot licenses. If that is your case, we recommend choosing the GitHub Copilot provider in the OpenClaw onboarding wizard. See the open source OpenClaw docs on configuring GitHub Copilot as the AI model provider. OpenClaw Onboarding: Setting up Messaging Channels During OpenClaw onboarding, there will be an optional step where you can set up various messaging channels to interact with your OpenClaw agent. For first time users, we recommend setting up Telegram due to ease of setup. Other messaging channels such as Microsoft Teams, Slack, WhatsApp, and others can also be set up. To configure OpenClaw for messaging through chat channels, see the open source OpenClaw chat channels docs. Step 3: Verify OpenClaw Configuration To validate that everything was set up correctly, run the following commands within the same Bastion SSH session: openclaw status openclaw gateway status If there are any issues reported, you can run the onboarding wizard again with the steps above. Alternatively, you can run the following command: openclaw doctor Message OpenClaw Once you have configured the OpenClaw agent to be reachable via various messaging channels, you can verify that it is responsive by messaging it. Enhancing OpenClaw for Use Cases There you go! You now have a 24/7, always-on personal AI agent, living on its own Azure VM environment. For awesome OpenClaw use cases, check out the awesome-openclaw-usecases repository. To enhance your OpenClaw agent with additional AI skills so that it can autonomously perform multi-step operations on any domain, check out the awesome-openclaw-skills repository. You can also check out ClawHub and ClawSkills, two popular open source skills directories that can enhance your OpenClaw agent. Cleanup To delete all resources created by this guide: az group delete -n "${RG}" --yes --no-wait This removes the resource group and everything inside it (VM, VNet, NSG, Bastion, public IP). This also deletes the OpenClaw agent running within the VM. If you'd like to dive deeper about deploying OpenClaw on Azure, please check out the open source OpenClaw on Azure docs.
johnsonshi_msft
May 07, 2026 Place Linux and Open Source Blog
6.4KViews
5likes
2Comments
Shift-Left Governance for AI Agents: How the Agent Governance Toolkit Helps You Catch Violations
In part one of this series, we covered AGT’s runtime governance: the policy engine, zero-trust identity, execution sandboxing, and the OWASP Agentic AI risk mapping. That post focused on what happens when an agent acts: policy evaluation at the moment a tool call fires, trust scoring when agents communicate, audit logging when decisions are made. Runtime governance is essential. But it is the last line of defense. After that post went live, a pattern emerged in conversations with teams adopting AGT. The same question kept coming up: runtime checks are useful, but what about everything before production? We realized runtime governance was only half the story. So we went back and built tooling for every stage of your software development lifecycle, from the moment a developer saves a file to the moment an artifact ships to users. Why Runtime Governance Is Not Enough AI agents are a new class of workload. They reason about what to do, select tools, call APIs, read databases, and spawn sub-processes, often in loops that run without direct human oversight. The OWASP Agentic AI Top 10 (published December 2025) identifies risks like excessive agency, insecure tool use, privilege escalation, and supply chain compromise. These risks span the entire lifecycle, not just runtime. Consider a few scenarios that runtime governance alone cannot prevent: A developer commits a policy YAML file with a typo that silently disables all deny rules. The agent runs unprotected until someone notices. A dependency update introduces a package with a known critical CVE. The agent starts using a vulnerable library before any security team reviews it. A contributor adds a raw cryptographic import to an application module, bypassing the security-audited signing library. The code compiles and ships. A GitHub Actions workflow uses an expression injection pattern that allows an attacker to execute arbitrary code in CI. A release ships without a Software Bill of Materials (SBOM), making it impossible to trace which components are affected when the next log4j-style vulnerability drops. Each of these is a governance failure, but none of them happens at runtime. They happen at commit time, at PR review time, at build time, or at release time. A comprehensive governance strategy needs coverage at every stage. Four Stages of Pre-Runtime Governance Governance violations can enter a codebase at four distinct stages of the development lifecycle. Each stage has a different class of risk, and each needs a different kind of check: Stage When It Runs What It Catches AGT Tooling Commit-time Before code leaves the developer machine Malformed policies, schema violations, secrets, stub code, unauthorized crypto Pre-commit hooks, quality gates PR-time When a pull request is opened or updated Vulnerable dependencies, missing attestation, secrets in history, unpinned versions GitHub Actions (attestation, dependency review, secret scanning, supply chain checks) CI/Build-time On every push and pull request to main Compliance violations, binary security issues, dependency confusion, workflow injection Governance Verify action, Security Scan action, CodeQL, BinSkim, policy validation Release-time Before artifacts are published Missing provenance, unsigned artifacts, incomplete SBOMs SBOM generation, Sigstore signing, build attestation, OpenSSF Scorecard Just as with bugs, the earlier you catch a governance violation, the cheaper it is to fix. A malformed policy file caught at commit time costs zero CI minutes. A secret caught in PR review never reaches the default branch. A dependency confusion attack blocked in CI never reaches production. An unsigned artifact blocked at release time never reaches users. Stage 1: Commit-Time Governance with Pre-Commit Hooks The fastest governance feedback loop is local. Within the AGT project, we’ve implemented three pre-commit hooks that run automatically whenever a developer stages files for commit, validating governance artifacts before they ever leave the developer's machine. Built-In Hooks The toolkit's .pre-commit-hooks.yaml defines three hooks that any repository can adopt: Hook ID What It Validates File Pattern validate-policy YAML/JSON policy files against the AGT policy schema, checking for required fields, valid operators, and structural correctness Files matching *polic*.yaml, *polic*.yml, *polic*.json validate-plugin-manifest Plugin manifest files for required fields and schema compliance Files matching plugin.json, plugin.yaml, plugin.yml evaluate-plugin-policy Plugin manifests against a governance policy file, evaluating whether the plugin would be allowed under the organization's rules Files matching plugin.json, plugin.yaml, plugin.yml To adopt these hooks, add AGT as a pre-commit hook source: # .pre-commit-config.yaml repos: - repo: https://github.com/microsoft/agent-governance-toolkit rev: main # pin to a release tag in production hooks: - id: validate-policy - id: validate-plugin-manifest - id: evaluate-plugin-policy args: ['--policy', 'policies/marketplace-policy.yaml'] Then install and run: pip install pre-commit pre-commit install pre-commit run --all-files Extended Quality Gates Beyond schema validation, we built a pre-commit rollout template (see the full example in the repository) with additional governance-specific quality gates designed to help prevent common security anti-patterns from entering the codebase: Policy validation (agt-validate): Runs the full AGT policy CLI in strict mode, catching not just schema errors but semantic issues like conflicting rules. Health check (agt-doctor): Runs on pre-push (before code leaves the machine entirely), performing a broader health check of the governance configuration. Plugin metadata check (agency-json-required): Ensures every plugin directory contains the required agency.json metadata file. Stub detection (no-stubs): Blocks TODO, FIXME, HACK, and raise NotImplementedError markers in staged production code. Test files are excluded. Unauthorized crypto detection (no-custom-crypto): Blocks raw cryptographic imports (hashlib, hmac, crypto.subtle, System.Security.Cryptography, ring, ed25519-dalek) outside designated security modules. This helps ensure all cryptographic operations go through the audited AGT signing libraries. Secret scanning (detect-secrets): Integrates Yelp's detect-secrets for pattern-based secret detection on every commit. Phased Rollout for Teams Adopting pre-commit hooks across a team requires a thoughtful rollout. The AGT documentation includes a phased adoption guide: Week 1: Install hooks in permissive mode. Hooks warn on violations but do not block the commit. This lets developers see what would be caught without disrupting workflow. Week 2: Switch to strict mode for policy validation only. Policy files must pass schema validation to be committed. Week 3: Enable all hooks as blocking. Stubs, unauthorized crypto, and secrets are now blocked at commit time. Week 4: Graduate to full blocking mode and remove the permissive fallback. This approach helps teams build confidence in the governance tooling before it becomes a hard gate. Stage 2: PR-Time Gates Pre-commit hooks catch issues on the developer's machine, but they can be bypassed (force push, direct GitHub edits, hooks not installed). PR-time gates provide the second layer of defense, running in GitHub Actions on every pull request before merge is allowed. Governance Attestation The Governance Attestation action validates that PR authors have completed a structured attestation checklist before their code can merge. The default checklist covers seven sections: Security review Privacy review Legal review Responsible AI review Accessibility review Release Readiness / Safe Deployment Org-specific Launch Gates The action is fully configurable. Organizations can customize the required sections, set a minimum PR body length, and choose their own attestation format. Outputs include the validation status, a list of errors for missing sections, and a JSON mapping of sections to checkbox counts. Here is an example workflow: # .github/workflows/pr-governance.yml name: PR Governance on: pull_request: types: [opened, edited, synchronize] jobs: attestation: runs-on: ubuntu-latest steps: - uses: microsoft/agent-governance-toolkit/action/governance-attestation@main with: required-sections: | 1) Security review 2) Privacy review 3) Responsible AI review Dependency Review The dependency review workflow helps block PRs that introduce dependencies with known CVEs or disallowed licenses. It uses the GitHub dependency-review-action with a curated license allowlist: - uses: actions/dependency-review-action@v4 with: fail-on-severity: moderate comment-summary-in-pr: always allow-licenses: > MIT, Apache-2.0, BSD-2-Clause, BSD-3-Clause, ISC, PSF-2.0, Python-2.0, 0BSD, Unlicense, CC0-1.0, CC-BY-4.0, Zlib, BSL-1.0, MPL-2.0 This runs on every PR that touches dependency manifests (package.json, Cargo.toml, pyproject.toml, requirements.txt). Dependencies with moderate or higher CVEs are flagged, and dependencies with licenses not on the allowlist are blocked. Secret Scanning The secret scanning workflow runs on every PR to the main branch and on a weekly schedule. It combines two complementary approaches: Gitleaks: Pattern-based secret detection across the full git history, catching API keys, tokens, and credentials that may have been committed at any point. High-entropy string scanning: Regex-based detection of common secret patterns including GitHub tokens (ghp_, gho_), AWS access keys (AKIA), Slack tokens (xox), and base64-encoded strings with high entropy. Supply Chain Integrity A dedicated supply chain check workflow triggers when dependency manifest files change. It enforces two rules that help prevent supply chain attacks: Exact version pinning: No ^ or ~ version ranges in package.json files. This prevents unexpected minor/patch version updates that could introduce compromised code. Lockfile presence: Every package directory with dependencies must have a corresponding lockfile (package-lock.json, pnpm-lock.yaml, or yarn.lock). Lockfiles help ensure reproducible builds with verified integrity hashes. Quality Gates The quality gates workflow mirrors the pre-commit hooks at the PR level, providing defense in depth. It runs four checks on every pull request: Gate Purpose No Stubs/TODOs Blocks TODO, FIXME, HACK markers in production code (test files excluded) No Unauthorized Crypto Blocks raw cryptographic imports outside designated security modules Security Audit Required Changes to security-sensitive paths require accompanying audit documentation Dependency Audit Trail Vendored patches must have an audit trail explaining the patch and its provenance These gates catch anything that bypasses pre-commit hooks: force-pushed commits, direct GitHub web edits, commits from contributors who have not installed the hooks. Stage 3: CI/Build-Time Governance Once a PR passes the gate workflows, the main CI pipeline and specialized workflows perform deeper, more computationally intensive analysis. The Governance Verify Action The Governance Verify action is the primary CI-time governance check. It is a GitHub Actions composite action that installs the toolkit and runs the compliance CLI against your repository. It supports four modes: Command What It Does governance-verify Runs the full compliance verification suite, checking governance controls and reporting how many pass marketplace-verify Validates a plugin manifest against marketplace requirements (required fields, signing, metadata) policy-evaluate Evaluates a specific policy file against a JSON context, returning the allow/deny decision with the matched rule all Runs governance-verify, then marketplace-verify and policy-evaluate if the corresponding paths are provided Here is an example: # .github/workflows/governance-ci.yml name: Governance CI on: [push, pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: all policy-path: policies/ manifest-path: plugin.json output-format: json fail-on-warning: 'true' The action outputs structured data including controls-passed, controls-total, violations count, and full command output in JSON format. This makes it straightforward to integrate with dashboards, Slack notifications, or downstream decision logic. The Security Scan Action A separate security scan action scans directories for secrets, CVEs, and dangerous code patterns. Unlike the PR-time secret scanning (which focuses on git history), this action performs deep content analysis of the current codebase: - uses: microsoft/agent-governance-toolkit/action/security-scan@main with: paths: 'plugins/ scripts/' min-severity: high exemptions-file: .security-exemptions.json The action supports configurable severity thresholds (critical, high, medium, low), an exemptions file for acknowledged findings, and structured JSON output with findings-count, blocking-count, and detailed findings. Policy Validation Workflow A dedicated policy validation workflow triggers whenever YAML files or the policy engine source code changes. It performs two jobs in sequence: Validate policies: Discovers all policy files matching the *policy* naming convention, then validates each file using the AGT policy CLI. Test policies: Runs the policy CLI unit tests to verify that policy evaluation behavior is correct after the changes. This ensures that policy file edits do not break the policy engine and that policy semantics are preserved. CodeQL and Static Analysis AGT uses GitHub's CodeQL for semantic static analysis of Python and TypeScript code. The CodeQL workflow runs on pushes and PRs, performing deep dataflow analysis that goes beyond pattern matching. Results are uploaded as SARIF to GitHub's Security tab, providing a centralized view of code quality issues. Dependency Confusion Scanning A dedicated CI job runs a dependency confusion scanner on every build. This is a targeted defense against a specific supply chain attack vector where an attacker registers a public package with the same name as an internal package. The scanner checks that: Internal package names do not collide with public PyPI or npm packages Notebook pip install commands only reference packages that are registered and expected Workflow Security Auditing When GitHub Actions workflow files change, a workflow security job scans for common CI/CD security issues: Expression injection: Detects patterns like ${{ github.event.pull_request.title }} used directly in run: blocks, which can allow arbitrary code execution. Overly permissive permissions: Flags workflows that request more permissions than necessary. Unpinned action references: Detects actions referenced by branch name instead of commit SHA, which is a supply chain risk. .NET Binary Analysis with BinSkim For the .NET SDK (Microsoft.AgentGovernance), the CI pipeline runs Microsoft BinSkim binary security analysis on compiled assemblies. BinSkim checks for security-relevant compiler and linker settings in compiled binaries, such as DEP (Data Execution Prevention), ASLR (Address Space Layout Randomization), and stack protection. Results are uploaded as SARIF to GitHub code scanning alongside the CodeQL results. The ci-complete Gate Pattern With many CI jobs that conditionally run based on path filters, AGT uses a pattern called ci-complete: a single gate job that is configured as the sole required status check in branch protection. This job runs unconditionally (if: always()), depends on all other CI jobs, and checks that none of them failed. Jobs that were skipped (because no relevant files changed) are acceptable. This pattern ensures that branch protection works correctly with conditional CI jobs, preventing the common issue where skipped jobs report as "skipped" and fail required status checks. Language-Specific Compile-Time Enforcement Beyond the language-agnostic CI checks, each AGT SDK uses its language's native compiler and tooling to enforce governance standards at compile time. .NET: The Strictest Compile-Time Checks The .NET SDK (Microsoft.AgentGovernance) enforces compile-time governance through MSBuild properties in Directory.Build.props and Directory.Build.targets, which apply automatically to every project in the SDK: Feature MSBuild Property Effect Nullable reference types <Nullable>enable</Nullable> The compiler warns on every possible null dereference, helping prevent NullReferenceException at compile time Warnings as errors <TreatWarningsAsErrors>true All compiler warnings become build errors for packable projects; no warnings can be shipped to consumers Strong-name signing <SignAssembly>true</SignAssembly> Assemblies are signed with a strong-name key (AgentGovernance.snk), enabling identity verification Deterministic builds <ContinuousIntegrationBuild>true Identical source code produces bit-for-bit identical binaries in CI, enabling build verification SourceLink Microsoft.SourceLink.GitHub package Users can step into AGT source code when debugging, supporting transparency and auditability Symbol packages <IncludeSymbols>true</IncludeSymbols> .snupkg symbol packages are published alongside NuGet packages for debugging support TypeScript: Strict Compilation and Linting The TypeScript SDK (@microsoft/agentmesh-sdk) uses strict compiler settings and ESLint for build-time governance: Strict mode ("strict": true in tsconfig.json) enables all strict type-checking options, including noImplicitAny, strictNullChecks, strictFunctionTypes, and strictBindCallApply. Consistent file naming (forceConsistentCasingInFileNames) prevents cross-platform issues where imports work on case-insensitive file systems (Windows, macOS) but fail on case-sensitive ones (Linux CI). Declaration generation (declaration: true with declarationMap: true) produces .d.ts files for consumers, enabling downstream type checking. ESLint with @typescript-eslint provides static analysis during the build process, catching issues beyond what the TypeScript compiler checks. Python: Type Safety and Fast Linting Python packages in AGT use typed package markers and static analysis tooling configured in pyproject.toml: py.typed marker: Each package includes a py.typed file, signalling to type checkers (mypy, pyright, Pylance) that the package supports type checking. Consumers get type errors if they misuse the AGT API. mypy: Configured as a dev dependency with project-specific settings in pyproject.toml. Provides static type checking that catches type mismatches before runtime. ruff: A fast Python linter written in Rust, configured in pyproject.toml and enforced in CI. Ruff checks for hundreds of code quality rules at build time. Stage 4: Release-Time Gates Before artifacts reach users, the release pipeline adds a final layer of verification. These gates help ensure that what ships is exactly what was built, is signed by the expected publisher, and has a complete inventory of its components. Gate Tool What It Produces SBOM generation Anchore/Syft SPDX and CycloneDX software bills of materials listing every component, dependency, and licence Python signing Sigstore Cryptographic signature using OpenID Connect identity, verifiable without manual key distribution .NET signing RELEASE PIPELINE Microsoft Authenticode and NuGet signing through the release pipeline Build provenance actions/attest-build-provenance SLSA provenance attestation linking the artifact to its source commit and build environment SBOM attestation actions/attest-sbom Binds the SBOM to the specific release artifact, creating a verifiable link between the inventory and the binary Additionally, the OpenSSF Scorecard runs on schedule, providing an automated security posture assessment that covers branch protection, dependency management, CI/CD practices, and more. The score is published to the OpenSSF Scorecard website, giving consumers a transparent view of the project security practices. How It All Fits Together: Defense in Depth This approach follows a defense-in-depth principle: every check exists at multiple layers, so that bypassing one layer does not compromise the whole system. Secret scanning, for example, runs at three levels: detect-secrets at commit time (pre-commit hook), Gitleaks at PR time (secret scanning workflow), and the Security Scan action at CI time (content analysis). A developer who bypasses pre-commit hooks will still be caught by the PR-time gate. A contributor who force-pushes past the PR gate will still be caught by the CI pipeline. Similarly, policy validation runs at commit time (validate-policy hook), at PR time (quality gates), and at CI time (policy validation workflow). Each layer adds depth: the commit-time hook catches schema errors, the CI pipeline catches semantic issues and runs regression tests. The ci-complete gate job ties everything together. By depending on every CI job and serving as the single required status check, it ensures that no code merges to the main branch unless every applicable check has passed. Getting Started You can adopt AGT's shift-left governance incrementally. Here are three starting points, from lowest to highest effort: 1. Add the Governance Verify Action (5 minutes) Add a single GitHub Actions workflow that runs the compliance check on every PR: # .github/workflows/governance.yml name: Governance on: [pull_request] jobs: verify: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: microsoft/agent-governance-toolkit/action@main with: command: governance-verify 2. Enable Pre-Commit Hooks (15 minutes) Add a .pre-commit-config.yaml referencing AGT's hooks, install them, and run against all existing files to establish a baseline. Start in permissive mode and graduate to strict over four weeks. 3. Full Pipeline Integration (1-2 hours) Add the complete set of PR-time gates (attestation, dependency review, secret scanning, supply chain checks, quality gates), configure the Security Scan action for your plugin directories, and enable SBOM generation and signing in your release workflow. The AGT repository itself serves as a reference implementation: every workflow described in this post is running in production at aka.ms/agent-governance-toolkit. Important Notes The policy files, workflow configurations, and code samples in this post are illustrative examples. Your organization's governance requirements may differ. Review and customize all configurations before deploying to production. The Agent Governance Toolkit is designed to help organizations implement governance controls for AI agents; it does not guarantee compliance with any specific regulatory framework. Always consult your organization's security and legal teams when defining governance policies. What Comes Next Pre-runtime governance is one piece of the puzzle. Combined with the runtime governance capabilities covered in part one of this series (policy engines, zero-trust identity, execution sandboxing, audit logging), it provides coverage across the full lifecycle. The project continues to grow. Since the initial release, we’ve added a multi-stage policy pipeline (pre_input, pre_tool, post_tool, pre_output stages), approval workflows with human-in-the-loop gates, DLP attribute ratchets for monotonic session state, and OpenTelemetry instrumentation for governance operations. Over 45 step-by-step tutorials are available in the documentation. Everything described in this post is available today in the public GitHub repository. The full source, documentation, tutorials, and examples are at aka.ms/agent-governance-toolkit, open source under the MIT license. We welcome contributions, feedback, and issue reports from the community.
mosiddi
May 01, 2026 Place Linux and Open Source Blog
396Views
0likes
0Comments
Retina 1.0 Is Now Available
We are excited to announce the first major release of Retina - a significant milestone for the project. This version brings along many new features, enhancements and bug fixes. The Retina maintainer team would like to thank all contributors, community members, and early adopters who helped make this 1.0 release possible. What is Retina? Retina is an open-source, Kubernetes network observability platform. It enables you to continuously observe and measure network health, and investigate network issues on-demand with integrated Kubernetes-native workflows. Why Retina? Kubernetes networking failures are rarely isolated or easy to reproduce. Pods are ephemeral, services span multiple nodes, and network traffic crosses multiple layers (CNI, kube-proxy, node networking, policies), making crucial evidence difficult to capture. Manually connecting to nodes and stitching together logs or packet captures simply does not scale as clusters grow in size and complexity. A modern approach to observability must automate and centralize data collection while exposing rich, actionable insights. Retina represents a major step forward in solving the complexities of Kubernetes observability by leveraging the power of eBPF. Its cloud-agnostic design, deep integration with Hubble, and support for both real-time metrics and on-demand packet captures make it an invaluable tool for DevOps, SecOps, and compliance teams across diverse environments. What Does It Do? Retina can collect two types of telemetry: metrics and packet captures. The Retina shell enables ad-hoc troubleshooting via pre-installed networking tools. Metrics Metrics provide continuous observability. They can be exported to multiple storage options such as Prometheus or Azure Monitor, and visualized in a variety of ways, including Grafana or Azure Log Analytics. Retina supports two control planes: Hubble and Standard. Both are supported regardless of the underlying CNI. The choice of control plane affects the metrics which are collected. Hubble metrics Standard metrics You can customize which metrics are collected by enabling/disabling their corresponding plugins. Some examples of metrics may include: Incoming/outcoming traffic Dropped packets TCP/UDP DNS API Server latency Node/interface statistics Packet Captures Captures provide on-demand observability. They allow users to perform distributed packet captures across the cluster, based on specified Nodes/Pods and other supported filters. They can be triggered via the CLI or through the capture CRD, and may be output to persistent storage options such as the host filesystem, a PVC, or a storage blob. The result of the capture contains more than just a .pcap file. Retina also captures a number of networking metadata such as iptables rules, socket statistics, kernel network information from /proc/net, and more. Shell The Retina shell enables deep ad-hoc troubleshooting by providing a suite of networking tools. The CLI command starts an interactive shell on a Kubernetes node that runs a container image which includes standard tools such as ping or curl, as well as specialized tools like bpftool, pwru, Inspektor Gadget and more. The Retina shell is currently only available on Linux. Note that some tools require particular capabilities to execute. These can be passed as parameters through the CLI. Use Cases Debugging Pod Connectivity Issues: When services can’t communicate, Retina enables rapid, automated distributed packet capture and drop metrics, drastically reducing troubleshooting time. The Retina shell also brings specialized tools for deep manual investigations. Continuous Monitoring of Network Health: Operators can set up alerts and dashboards for DNS failures, API server latency, or packet drops, gaining ongoing visibility into cluster networking. Security Auditing and Compliance: Flow logs (in Hubble mode) and metrics support security investigations and compliance reporting, enabling quick identification of unexpected connections or data transfers. Multi-Cluster / Multi-Cloud Visibility: Retina standardizes network observability across clouds, supporting unified dashboards and processes for SRE teams. Where Does It Run? Retina is designed for broad compatibility across Kubernetes distributions, cloud providers, and operating systems. There are no Azure-specific dependencies - Retina runs anywhere Kubernetes does. Operating Systems: Both Linux and Windows nodes are supported. Kubernetes Distributions: Retina is distribution-agnostic, deployable on managed services (AKS, EKS, GKE) or self-managed clusters. CNI / Network Stack: Retina works with any CNI, focusing on kernel-level events rather than CNI-specific logs. Cloud Integration: Retina exports metrics to Azure Monitor and Log Analytics, with pre-built Grafana dashboards for AKS. Integration with AWS CloudWatch or GCP Stackdriver is possible via Prometheus. Observability Stacks: Retina integrates with Prometheus & Grafana, Cilium Hubble (for flow logs and UI), and can be extended to other exporters. Design Overview Retina’s architecture consists of two layers: a data collection layer in the kernel-space, and processing layer that converts low-level signals into Kubernetes-aware telemetry in the user-space. When Retina is installed, each node in the cluster runs a Retina agent which collects raw network telemetry from the host kernel - backed by eBPF on Linux, and HNS/VFP on Windows. The agent processes the raw network data and enriches it with Kubernetes metadata, which is then exported for consumption by monitoring tools such as Prometheus, Grafana, or Hubble UI. Modularity and extensibility are central to the design philosophy. Retina's plugin model lets you enable only the telemetry you need, and add new sources by implementing a common plugin interface. Built-in plugins include Drop Reason, DNS, Packet Forward, and more. Check out our architecture docs for a deeper dive into Retina's design. Get Started Thanks to Helm charts deploying Retina is streamlined across all environments, and can be done with one configurable command. For complete documentation, visit our installation docs. To install Retina with the Standard control plane and Basic metrics mode: VERSION=$( curl -sL https://api.github.com/repos/microsoft/retina/releases/latest | jq -r .name) helm upgrade --install retina oci://ghcr.io/microsoft/retina/charts/retina \ --version $VERSION \ --namespace kube-system \ --set image.tag=$VERSION \ --set operator.tag=$VERSION \ --set logLevel=info \ --set operator.enabled=true \ --set enabledPlugin_linux="\[dropreason\,packetforward\,linuxutil\,dns\]" Once Retina is running in your cluster, you can then configure Prometheus and Grafana to scrape and visualize your metrics. Install the Retina CLI with Krew: kubectl krew install retina Get Involved Retina is open-source under the MIT License and welcomes community contributions. Since its announcement in early 2024, the project has gained significant traction, with contributors from multiple organizations helping to expand its capabilities. The project is hosted on GitHub · microsoft/retina and documentation is available at retina.sh. If you would like to contribute to Retina you can follow our contributor guide. What's Next? Retina 1.1 of course! We are also discussing the future roadmap, and exploring the possibility of moving the project to community ownership. Stay tuned! In the meantime, we welcome you to raise an issue if you find any bugs, or start a discussion if you have any questions or suggestions. You can also reach out to the Retina team via email, we would love to hear from you! References Retina Deep Dive into Retina Open-Source Kubernetes Network Observability Troubleshooting Network Issues with Retina Retina: Bridging Kubernetes Observability and eBPF Across the Clouds
kamilp
Feb 03, 2026 Place Linux and Open Source Blog
801Views
0likes
0Comments
Azure Linux: Driving Security in the Era of AI Innovation
Microsoft is advancing cloud and AI innovation with a clear focus on security, quality, and responsible practices. At Ignite 2025, Azure Linux reflects that commitment. As Microsoft’s ubiquitous Linux OS, it powers critical services and serves as the hub for security innovation. This year’s announcements, Azure Linux with OS Guard public preview and GA of pod sandboxing, reinforce security as one of our core priorities, helping customers build and run workloads with confidence in an increasingly complex threat landscape. Announcing OS Guard Public Preview We’re excited to announce the public preview of Azure Linux with OS Guard at Ignite 2025! OS Guard delivers a hardened, immutable container host built on the FedRAMP-certified Azure Linux base image. It introduces a significantly streamlined footprint with approximately 100 fewer packages than the standard Azure Linux image, reducing the attack surface and improving performance. FIPS mode is enforced by default, ensuring compliance for regulated workloads right out of the box. Additional security features include dm-verity for filesystem immutability, Trusted Launch backed by vTPM-secured keys, and seamless integration with AKS for container workloads. Built with upstream transparency and active Microsoft contributions, OS Guard provides a secure foundation for containerized applications while maintaining operational simplicity. During the preview period, code integrity and mandatory access Control (SELinux) are enabled in audit mode, allowing customers to validate policies and prepare for enforcement without impacting workloads. General Availability: Pod Sandboxing for stronger isolation on AKS We’re also announcing the GA of pod sandboxing on AKS, delivering stronger workload isolation for multi-tenant and regulated environments. Based on the open source Kata project, Pod Sandboxing introduces VM-level isolation for containerized workloads by running each pod inside its own lightweight virtual machine using Kata Containers, providing a stronger security boundary compared to traditional containers. Connect with us at Ignite Meet the Azure Linux team and see these innovations in action: Ignite: Join us at our breakout session (https://ignite.microsoft.com/en-US/sessions/BRK144) and visit the Linux on Azure Booth for live demos and deep dives. Session Type Session Code Session Name Date/Time (PST) Breakout BRK 143 Optimizing performance, deployments, and security for Linux on Azure Thu, Nov 20/ 1:00 PM – 1:45 PM Breakout BRK 144 Build, modernize, and secure AKS workloads with Azure Linux Wed, Nov 19/ 1:30 PM – 2:15 PM Breakout BRK 104 From VMs and containers to AI apps with Azure Red Hat OpenShift Thu, Nov 20/ 8:30 AM – 9:15 AM Theatre TRH 712 Hybrid workload compliance from policy to practice on Azure Tue, Nov 18/ 3:15 PM – 3:45 PM Theatre THR 701 From Container to Node: Building Minimal-CVE Solutions with Azure Linux Wed, Nov 19/ 3:30 PM – 4:00 PM Lab Lab 505 Fast track your Linux and PostgreSQL migration with Azure Migrate Tue, Nov 18/ 4:30 PM – 5:45 PM PST Wed, Nov 19/ 3:45 PM – 5:00 PM PST Thu, Nov 20/ 9:00 AM – 10:15 AM PST Whether you’re migrating workloads, exploring security features, or looking to engage with our engineering team, we’re eager to connect and help you succeed with Azure Linux. Resources to get started Azure Linux OS Guard Overview & QuickStart: https://aka.ms/osguard Pod Sandboxing Overview & QuickStart: https://aka.ms/podsandboxing Azure Linux Documentation: https://learn.microsoft.com/en-us/azure/azure-linux/
Sudhanva
Nov 18, 2025 Place Linux and Open Source Blog
696Views
3likes
0Comments
From Policy to Practice: Built-In CIS Benchmarks on Azure - Flexible, Hybrid-Ready
Security is more important than ever. The industry-standard for secure machine configuration is the Center for Internet Security (CIS) Benchmarks. These benchmarks provide consensus-based prescriptive guidance to help organizations harden diverse systems, reduce risk, and streamline compliance with major regulatory frameworks and industry standards like NIST, HIPAA, and PCI DSS. In our previous post, we outlined our plans to improve the Linux server compliance and hardening experience on Azure and shared a vision for integrating CIS Benchmarks. Today, that vision has turned into reality. We're now announcing the next phase of this work: Center for Internet Security (CIS) Benchmarks are now available on Azure for all Azure endorsed distros, at no additional cost to Azure and Azure Arc customers. With today's announcement, you get access to the CIS Benchmarks on Azure with full parity to what’s published by the Center for Internet Security (CIS). You can adjust parameters or define exceptions, tailoring security to your needs and applying consistent controls across cloud, hybrid, and on-premises environments - without having to implement every control manually. Thanks to this flexible architecture, you can truly manage compliance as code. How we achieve parity To ensure accuracy and trust, we rely on and ingest CIS machine-readable Benchmark content (OVAL/XCCDF files) as the source of truth. This guarantees that the controls and rules you apply in Azure match the official CIS specifications, reducing drift and ensuring compliance confidence. What’s new under the hood At the core of this update is azure-osconfig’s new compliance engine - a lightweight, open-source module developed by the Azure Core Linux team. It evaluates Linux systems directly against industry-standard benchmarks like CIS, supporting both audit and, in the future, auto-remediation. This enables accurate, scalable compliance checks across large Linux fleets. Here you can read more about azure-osconfig. Dynamic rule evaluation The new compliance engine supports simple fact-checking operations, evaluation of logic operations on them (e.g., anyOf, allOf) and Lua based scripting, which allows to express complex checks required by the CIS Critical Security Controls - all evaluated natively without external scripts. Scalable architecture for large fleets When the assignment is created, the Azure control plane instructs the machine to pull the latest Policy package via the Machine Configuration agent. Azure-osconfig’s compliance engine is integrated as a light-weight library to the package and called by Machine Configuration agent for evaluation – which happens every 15-30minutes. This ensures near real-time compliance state without overwhelming resources and enables consistent evaluation across thousands of VMs and Azure Arc-enabled servers. Future-ready for remediation and enforcement While the Public Preview starts with audit-only mode, the roadmap includes per-rule remediation and enforcement using technologies like eBPF for kernel-level controls. This will allow proactive prevention of configuration drift and runtime hardening at scale. Please reach out if you interested in auto-remediation or enforcement. Extensibility beyond CIS Benchmarks The architecture was designed to support other security and compliance standards as well and isn’t limited to CIS Benchmarks. The compliance engine is modular, and we plan to extend the platform with STIG and other relevant industry benchmarks. This positions Azure as a platform for a place where you can manage your compliance from a single control-plane without duplicating efforts elsewhere. Collaboration with the CIS This milestone reflects a close collaboration between Microsoft and the CIS to bring industry-standard security guidance into Azure as a built-in capability. Our shared goal is to make cloud-native compliance practical and consistent, while giving customers the flexibility to meet their unique requirements. We are committed to continuously supporting new Benchmark releases, expanding coverage with new distributions and easing adoption through built-in workflows, such as moving from your current Benchmark version to a new version while preserving your custom configurations. Certification and trust We can proudly announce that azure-osconfig has met all the requirements and is officially certified by the CIS for Benchmark assessment, so you can trust compliance results as authoritative. Minor benchmark updates will be applied automatically, while major version will be released separately. We will include workflows to help migrate customizations seamlessly across versions. Key Highlights Built-in CIS Benchmarks for Azure Endorsed Linux distributions Full parity with official CIS Benchmarks content and certified by the CIS for Benchmark Assessment Flexible configuration: adjust parameters, define exceptions, tune severity Hybrid support: enforce the same baseline across Azure, on-prem, and multi-cloud with Azure Arc Reporting format in CIS tooling style Supported use cases Certified CIS Benchmarks for all Azure Endorsed Distros - Audit only (L1/L2 server profiles) Hybrid / On-premises and other cloud machines with Azure Arc for the supported distros Compliance as Code (example via Github -> Azure OIDC auth and API integration) Compatible with GuestConfig workbook What’s next? Our next mission is to bring the previously announced auto-remediation capability into this experience, expand the distribution coverage and elevate our workflows even further. We’re focused on empowering you to resolve issues while honoring the unique operational complexity of your environments. Stay tuned! Get Started Documentation link for this capability Enable CIS Benchmarks in Machine Configuration and select the “Official Center for Internet Security (CIS) Benchmarks for Linux Workloads” then select the distributions for your assignment, and customize as needed. In case if you want any additional distribution supported or have any feedback for azure-osconfig – please open an Azure support case or a Github issue here Relevant Ignite 2025 session: Hybrid workload compliance from policy to practice on Azure Connect with us at Ignite Meet the Linux team and stop by the Linux on Azure booth to see these innovations in action: Session Type Session Code Session Name Date/Time (PST) Theatre THR 712 Hybrid workload compliance from policy to practice on Azure Tue, Nov 18/ 3:15 PM – 3:45 PM Breakout BRK 143 Optimizing performance, deployments, and security for Linux on Azure Thu, Nov 20/ 1:00 PM – 1:45 PM Breakout BRK 144 Build, modernize, and secure AKS workloads with Azure Linux Wed, Nov 19/ 1:30 PM – 2:15 PM Breakout BRK 104 From VMs and containers to AI apps with Azure Red Hat OpenShift Thu, Nov 20/ 8:30 AM – 9:15 AM Theatre THR 701 From Container to Node: Building Minimal-CVE Solutions with Azure Linux Wed, Nov 19/ 3:30 PM – 4:00 PM Lab Lab 505 Fast track your Linux and PostgreSQL migration with Azure Migrate Tue, Nov 18/ 4:30 PM – 5:45 PM PST Wed, Nov 19/ 3:45 PM – 5:00 PM PST Thu, Nov 20/ 9:00 AM – 10:15 AM PST
pallakatos
Nov 18, 2025 Place Linux and Open Source Blog
1.3KViews
0likes
0Comments
Linux on Azure at Microsoft Ignite 2025: What’s New, What to Attend, and Where to Find Us
Microsoft Ignite 2025 is almost here, and we’re heading back to San Francisco from November 17-21 with a full digital experience for those joining online. Every year, Ignite brings together IT pros, developers, security teams, and technology leaders from around the world to explore the future of cloud, AI, and infrastructure. This year, Linux takes center stage in a big way. From new security innovations in Azure Linux to deeper AKS modernization capabilities and hands-on learning opportunities, Ignite 2025 is packed with content for anyone building, running, or securing Linux-based workloads in Azure. Below is your quick guide to the biggest Linux announcements and the must-see sessions. Major Linux Announcements at Ignite 2025 Public Preview: Built-in CIS Benchmarks for Azure Endorsed Linux Distributions CIS Benchmarks are now integrated directly into Azure Machine Configuration, giving you automated and customizable compliance monitoring across Azure, hybrid, and on-prem environments. This makes it easier to continuously govern your Linux estate at scale with no external tooling required. Public Preview: Azure Linux OS Guard Azure Linux OS Guard introduces a hardened, immutable Linux container host for AKS with FIPS mode enforced by default, a reduced attack surface, and tight AKS integration. It is ideal for highly regulated or sensitive workloads and brings stronger default security with less operational complexity. General Availability: Pod Sandboxing for AKS (Kata Containers) Pod Sandboxing with fully managed Kata Containers is now GA, delivering VM-level isolation for AKS workloads. This provides stronger separation of CPU, memory, and networking and is well-suited for multi-tenant applications or organizations with strict compliance boundaries. Linux Sessions at Ignite Whether you are optimizing performance, modernizing with containers, or exploring new security scenarios, there is something for every Linux practitioner. Breakout Sessions Session Code Session Name Date and Time (PST) BRK143 Optimizing performance, deployments, and security for Linux on Azure Thu Nov 20, 1:00 PM to 1:45 PM BRK144 Build, modernize, and secure AKS workloads with Azure Linux Wed Nov 19, 1:30 PM to 2:15 PM BRK104 From VMs and containers to AI apps with Azure Red Hat OpenShift Thu Nov 20, 8:30 AM to 9:15 AM BRK137 Nasdaq Boardvantage: AI-driven governance on PostgreSQL and AI Foundry Wed Nov 19, 11:30 AM to 12:15 PM Theatre Sessions Session Code Session Name Date and Time (PST) THR712 Hybrid workload compliance from policy to practice on Azure Tue Nov 18, 3:15 PM to 3:45 PM THR701 From Container to Node: Building Minimal-CVE Solutions with Azure Linux Wed Nov 19, 3:30 PM to 4:00 PM Hands-on Lab Lab 505: Fast track your Linux and PostgreSQL migration with Azure Migrate Tue Nov 18, 4:30 PM to 5:45 PM Wed Nov 19, 3:45 PM to 5:00 PM Thu Nov 20, 9:00 AM to 10:15 AM This interactive lab helps you assess, plan, and execute Linux and PostgreSQL migrations at scale using Azure Migrate’s end-to-end tooling. Meet the Linux on Azure Team at Ignite If you are attending in person, come say hello. Visit the Linux on Azure Expert Meetup stations inside the Microsoft Hub. You can ask questions directly to Microsoft’s Linux engineering and product experts, explore demos across Azure Linux, compliance, and migration, and get recommendations tailored to your workloads. We always love meeting customers and partners.
shreyabaheti
Nov 17, 2025 Place Linux and Open Source Blog
444Views
1like
0Comments
Dalec: Declarative Package and Container Builds
Build once, deploy everywhere. From a single YAML specification, Dalec produces native Linux packages (RPM, DEB) and container images - no Dockerfiles, no complex RPM spec or control files, just declarative configuration. Dalec, a Cloud Native Computing Foundation (CNCF) Sandbox project, is a Docker BuildKit frontend that enables users to build system packages and container images from declarative YAML specifications. As a BuildKit frontend, Dalec integrates directly into the Docker build process, requiring no additional tools beyond Docker itself.
SertacOzercan
Oct 29, 2025 Place Linux and Open Source Blog
542Views
0likes
0Comments