best practices
1754 TopicsGitHub Copilot App - Canvas Is Not a UI Builder
What if your development environment didn't just help you write code, but helped you observe, steer, and evolve a living system while it runs? That's the shift GitHub Copilot App Canvas represents. Canvas redefines how developers interact with agent-driven software: not by building traditional user interfaces, but by creating interactive environments where humans and AI co-create, test, and iterate in real time. This post walks through a real Canvas extension we built, a Multi-Agent Dev Canvas that demonstrates how Canvas becomes a runtime observability and control plane for an agent-driven system. We'll cover why Canvas exists, how it differs from traditional UI development, and how you can use it to accelerate the design-test-evolve loop for any multi-agent application. The Misconception: "Canvas Is for Building UIs" The first instinct many developers have when they see Canvas is to treat it like a UI framework, a place to build dashboards, boards, or user-facing applications. That's not what Canvas is for. Here's the distinction that matters: Traditional UIs are for using software. They serve end-users who interact with a finished product. Canvas is for shaping software while it runs. It serves developers and AI agents who are actively building, testing, and evolving a system. Canvas solves problems your final UI should never try to solve in a visible way. It's the observability layer, the control plane, the validation surface — all the things you need during development that disappear before production. Think of it this way: you wouldn't ship your debugger to users, but you absolutely need it while building. What We Built: A Multi-Agent Dev Canvas To demonstrate Canvas as a development runtime, we built a Multi-Agent Dev Canvas, a standalone GitHub Copilot Canvas extension (this repo, copilot-canvas-runtime) that treats an entire multi-agent system as a living, observable environment. The same pattern applies to any agent-driven system built on services such as Microsoft Foundry. The Multi-Agent Dev Canvas: a runtime observability and control plane where developers and AI agents collaborate to design, test, and evolve an agent-driven system in real time. The canvas provides four integrated panels: System View: See Your Agents Working Five specialised agents are displayed as live cards with real-time status indicators. Each card shows the agent's name, responsibility, current status (idle, running, done, or error), task count, and last action taken. When an agent is active, its card pulses blue. When it fails, it glows red. You see the system breathe. decompose_system — Breaks requirements into agent tasks execute_workflow — Coordinates agents to perform tasks validate_output — Runs evaluation tests and returns structured results update_system_design — Modifies architecture based on feedback track_state — Persists and updates system state over time Task Flows: Watch Work Move Through the Pipeline Below the agents, a flow graph visualises how tasks route between agents. When you decompose a system requirement like "Build an AI-powered code review agent," the canvas shows five components (pr-ingestion, code-analysis, feedback-generator, learning-loop, notification-service) flowing from the decomposer to the executor and designer agents. Each flow carries a status badge, pending, pass, or fail. Validation Panel: Continuous Testing, Not Afterthought Testing The validation panel displays structured test results with pass/fail badges and reasoning. When you run validation, each test case evaluates against specific criteria: ✅ "PR ingestion handles large diffs" — Meets criteria: process diffs over 5,000 lines without timeout ❌ "Feedback is actionable" — Failed: does not satisfy criteria that each suggestion includes a code fix ✅ "Learning loop converges" — Meets criteria: accept rate improves over 10 iterations ✅ "Notifications are non-blocking" — Meets criteria: delivery latency under 500ms This isn't a test runner you invoke separately, it's a validation surface embedded in the development loop. You see failures the moment they happen, in context, alongside the agents and flows that produced them. Live State Timeline: Every Mutation, Visible The right panel tracks every state change with timestamps. Decomposition events, workflow executions, validation runs, failure injections — all appear chronologically. This is the system's memory, visible to both the human developer and the AI agents working alongside them. Canvas as a Runtime: The Key Capabilities What makes Canvas a runtime rather than a display layer is that the agent can act through it. The canvas exposes seven agent-callable actions: Action What It Does decompose_system Accept requirements and components, generate task flows, update the system design execute_workflow Run pending tasks through the agent pipeline, produce artifacts validate_output Evaluate test cases against criteria, return structured pass/fail with reasoning update_system_design Modify the architecture description, constraints, or component list live track_state Read the full system state — agents, flows, validations, history, artifacts inject_failure Force an agent into an error state to test system adaptation pause_resume Toggle execution on and off The human developer can click Decompose, Execute, or Validate directly in the canvas. The AI agent can invoke the same actions programmatically. Both parties operate on the same surface, the same state, the same system, that's what makes Canvas collaborative in a way traditional tooling is not. Why This Matters: Canvas vs. Figma vs. Traditional UIs It helps to position Canvas against tools developers already know: Figma is Human-to-Human collaboration on design. Multiple people interact with the same visual surface, but nothing executes. It's a design tool. Traditional UIs are Human-to-System. Users interact with finished software through a polished interface. Canvas is Human-to-AI-to-System. It's a shared space where things actually execute. The developer steers, the AI acts, and the system evolves, all visible, all in real time. Canvas is collaborative in the Figma sense — it's a shared space, it's visual, multiple participants interact with the same surface. But unlike Figma, the participants include AI agents, and the surface isn't a mockup — it's a live system. How the Extension Works: Under the Hood A Canvas extension is a standard GitHub Copilot CLI extension, a single extension.mjs file that speaks JSON-RPC over stdio. The key components: 1. State Management Each canvas instance maintains its own system state: agents, task flows, validations, a state history timeline, artifacts, and the current system design. State is held in-memory per instance and pushed to the iframe via Server-Sent Events whenever it changes. function createInitialState() { return { agents: [ { id: "decomposer", name: "decompose_system", status: "idle", responsibility: "Break requirements into agent tasks" }, { id: "executor", name: "execute_workflow", status: "idle", responsibility: "Coordinate agents to perform tasks" }, // ... three more agents ], taskFlows: [], validations: [], stateHistory: [], artifacts: [], systemDesign: { description: "", constraints: [], components: [] }, execution: { paused: false, stepCount: 0 }, }; } 2. Real-Time Updates via Server-Sent Events The canvas runs a loopback HTTP server per instance. The iframe connects to an /events endpoint and receives state updates as they happen — no polling, no websocket complexity. if (req.url === "/events") { res.writeHead(200, { "Content-Type": "text/event-stream", "Cache-Control": "no-cache" }); clients.add(res); // Push current state immediately on connect res.write(`data: ${JSON.stringify(getState(instanceId))}\n\n`); } 3. Dual Interaction Model Every action is available through two paths. The human clicks a button in the iframe, which POSTs to the local server. The AI agent calls invoke_canvas_action through the SDK. Both paths mutate the same state and trigger the same SSE broadcast. Neither is privileged over the other. 4. Canvas Declaration The canvas registers with the Copilot SDK using createCanvas , declaring its identity, description, and all agent-callable actions with JSON Schema validation on inputs: createCanvas({ id: "multi-agent-dev", displayName: "Multi-Agent Dev Canvas", description: "Runtime observability and control plane for multi-agent development", actions: [ { name: "decompose_system", description: "Break requirements into agent tasks", inputSchema: { type: "object", properties: { requirements: { type: "string" }, components: { type: "array", items: { type: "string" } } }, required: ["requirements"] }, handler: async (ctx) => { /* ... */ }, }, // ... six more actions ], open: async (ctx) => { /* start server, return URL */ }, onClose: async (ctx) => { /* clean up */ }, }); Scenarios This Enables The Multi-Agent Dev Canvas supports four development scenarios that would be impossible with traditional tooling: 1. End-to-End Feature Design Tell the agent "Build an AI-powered code review system." Watch it decompose the requirement into five components, route tasks to specialist agents, execute the workflow, and validate the outputs, all visible in real time. Iterate by modifying constraints or components and re-running. 2. Live Agent Collaboration Observation See how agents hand off work to each other. The flow graph shows which agent produced what, which tasks are pending, and where bottlenecks form. This is the kind of observability you need when debugging multi-agent orchestration but would never expose in a production UI. 3. Fault Injection and Adaptation Testing Use inject_failure to force an agent into an error state. Watch how the system responds. Does the orchestrator recover? Do downstream tasks fail gracefully? This chaos-engineering approach, applied during development, visible in real time, catches integration failures before they reach production. 4. Validation-Driven Iteration Define test criteria, run validation, see which tests fail, update the system design, re-run. The validation panel isn't a separate CI pipeline, it's embedded in the development surface, creating a continuous feedback loop between design decisions and their measurable outcomes. Getting Started: Build Your Own Canvas Extension To create a Canvas extension in your own project: Read the SDK docs — Run extensions_manage({ operation: "guide" }) in GitHub Copilot CLI to get the canonical documentation paths. Scaffold — Run extensions_manage({ operation: "scaffold", kind: "canvas", name: "my-canvas", location: "project" }) to generate the boilerplate. Implement — Edit extension.mjs with your canvas logic: state model, actions, renderer HTML, and SSE updates. Reload — Run extensions_reload to activate your changes. Drive — Open with open_canvas , invoke actions with invoke_canvas_action , and iterate. The canvas extension lives in .github/extensions/your-canvas/extension.mjs for project-scoped extensions, or in your user extensions directory for personal use. No package.json needed, the github/copilot-sdk import is auto-resolved. Key Takeaways Canvas is a development runtime, not a UI framework. You don't build Canvas instead of your UI, you use Canvas to figure out, test, and evolve the UI and system before and during building it. Canvas solves problems your final UI should never expose. Agent observability, fault injection, live state mutation, validation feedback loops, these are development concerns, not user concerns. Canvas is Human-to-AI-to-System collaboration. Both the developer and the AI agent operate on the same surface, the same state, the same running system. It's Figma-like collaboration, but with AI agents, and things actually execute. Canvas turns debugging, testing, and execution into a continuous visual feedback loop. Instead of switching between an editor, a terminal, a test runner, and a monitoring dashboard, you have one surface where the system lives and evolves. Canvas extensions are lightweight. A single extension.mjs file, no dependencies, loopback HTTP server with SSE, the infrastructure gets out of the way so you can focus on the system you're building. The Bigger Picture Canvas redefines software development by shifting from writing static code to orchestrating living systems. Developers and AI co-create, observe, and evolve solutions in real time. Instead of building UIs for users, we build interactive environments for agents, turning debugging, testing, and execution into a continuous, visual feedback loop that accelerates innovation and brings ideas to production faster than ever. The Multi-Agent Dev Canvas we built here is one example. The pattern applies anywhere you're building agent-driven systems: AI orchestration, workflow automation, data pipelines, autonomous services. Anywhere you need to see, steer, and validate a complex system as it runs, that's where Canvas belongs. Resources copilot-canvas-runtime — this repository: the Multi-Agent Dev Canvas extension, scenario, and demo prompt GitHub Copilot Documentation — Official documentation for GitHub Copilot features Microsoft Foundry Documentation — Build and deploy AI agents with Microsoft FoundryMCP for Beginners: Why Every AI Engineer and Developer Should Learn the Model Context Protocol
If you have spent any time building with large language models in the last year, you have hit the same wall everyone hits: your model is brilliant at reasoning but blind to the real world. It cannot read your database, call your internal API, search your documents, or trigger a deployment unless you hand-write glue code for every single integration. The Model Context Protocol (MCP) exists to tear that wall down, and Microsoft's open-source MCP for Beginners curriculum (reachable via the short link https://aka.ms/mcp-for-beginners) is the most complete, hands-on way to learn it. This post explains what MCP is, walks through the latest updates to the course, shows real code, and makes the case for why MCP belongs on your learning roadmap right now. Whether you are an AI engineer shipping agents to production, a developer wiring tools into Copilot, or a student trying to build a standout portfolio project. What is MCP, and why does it matter? Think of MCP as a universal translator for AI applications. Just as a USB-C port lets you connect any peripheral to any laptop without a custom cable per device, MCP lets an AI model connect to any tool or data source through one standardized protocol. The course uses exactly this analogy, and it holds up well. Before MCP, integrations were an M × N problem: every one of your M AI applications needed bespoke code to talk to each of your N tools. MCP turns that into an M + N problem. Build a tool once as an MCP server, and any MCP-compatible client, Claude Desktop, VS Code, Cursor, GitHub Copilot, and many others — can use it immediately. The protocol is built on a clean client–server model with a small set of primitives: Tools — functions the model can call (query a database, send an email, run code). Resources — data the server exposes for context (files, records, documents). Prompts — reusable, parameterized prompt templates. Sampling — a server asking the client's LLM to generate a completion, enabling collaborative workflows. Elicitation — a server requesting structured input from the user mid-task. Roots — boundaries that tell a server which directories or resources it is allowed to operate on. Communication runs over JSON-RPC, with transports for local processes ( stdio ) and remote servers (streamable HTTP). That standardization is the whole point: write to the spec, and you interoperate with the entire ecosystem. What's new: the latest updates to the course The MCP for Beginners curriculum is actively maintained, and the public changelog reads like a release log for a living product. Here are the most important recent changes, drawn directly from that changelog. 1. Aligned to MCP Specification The biggest update: the entire curriculum has been validated against the current MCP Specification 2025-11-25 and the latest official SDKs. Stale references to older spec revisions (2025-03-26 and 2025-06-18) were corrected across the security, transport, real-time search, sampling, and stdio-server modules, with links repointed to the canonical modelcontextprotocol.io spec paths. A gap analysis confirmed the course already covers every primitive introduced or expanded in the latest spec: Sampling — covered in lesson 3.14 and Advanced Topics. Elicitation (including URL mode) — in Core Concepts and Protocol Features. Roots — in the Introduction, Core Concepts, and Root Contexts. Tasks (experimental, long-running operations) — in Core Concepts and Protocol Features. Tool Annotations ( readOnlyHint / destructiveHint ) — in Core Concepts and Protocol Features. 2. Samples validated against current SDKs Code that does not run is worse than no code at all, so the maintainers re-validated the core samples: TypeScript: @modelcontextprotocol/sdk resolved to 1.29.0 ; a tsc --noEmit type-check passed with no errors — the McpServer and StdioServerTransport APIs remain valid. Python: validated in an isolated virtual environment with mcp[cli] (1.27.2); FastMCP.list_tools() correctly returned the sample add and subtract tools. SDK version pins across labs were bumped (for example mcp>=1.26.0 ) and lockfiles regenerated so every sample tracks the current release. 3. A serious security pass Security is treated as a first-class concern, not an afterthought. A full audit across every dependency manifest and the sample source code was run, and npm audit now reports 0 vulnerabilities in every audited directory. Highlights: Transitive npm advisories (in the MCP Inspector dev tool, the OpenAI client, and the SDK) were remediated by bumping @modelcontextprotocol/inspector to 0.22.0 and pinning a patched shell-quote . A real code-level command-injection fix (OWASP A03): an open_in_vscode tool that used subprocess.run(..., shell=True) was rewritten to launch the resolved executable directly with no shell — closing a metacharacter-injection vector. Python dependencies were audited with pip-audit , and a vulnerable transitive werkzeug was pinned to a patched >=3.1.6 . For anyone learning to ship agents, this is gold: the course demonstrates the whole secure-development loop, not just the happy path. 4. New lessons and a growing curriculum The curriculum keeps expanding with practical, modern lessons: 5.17 Adversarial Multi-Agent Reasoning — two agents argue opposite sides of a question using shared MCP tools ( web_search + run_python ), judged by a third agent. Includes a Mermaid architecture diagram, orchestrators in Python, TypeScript, and C#, and use cases like hallucination detection, threat modeling, and API design review. 3.12 MCP Hosts — configuration for Claude Desktop, VS Code, Cursor, Cline, and Windsurf, with JSON templates and a transport comparison table. 3.13 MCP Inspector — a debugging guide for testing tools, resources, and prompts. 4.1 Pagination — cursor-based pagination patterns in Python, TypeScript, and Java. 5.16 Protocol Features — progress notifications, request cancellation, resource templates, and lifecycle management. 5. Microsoft product rebranding Content was updated to reflect Microsoft's rebranding: Azure AI Foundry → Microsoft Foundry, and the AI Toolkit (AITK) → Microsoft Foundry Toolkit Extension for VS Code. If you have seen older tutorials referencing the previous names, the curriculum is now current. Your first MCP server: see how little code it takes The course's "first server" lesson builds a simple calculator. Here is the shape of a minimal MCP server in Python using FastMCP , which mirrors the validated sample in the repo. Notice how the protocol plumbing disappears — you just decorate functions. # server.py — a minimal MCP server with two tools from mcp.server.fastmcp import FastMCP # Name your server; this identifies it to MCP clients mcp = FastMCP("Calculator") @mcp.tool() def add(a: int, b: int) -> int: """Add two numbers and return the result.""" return a + b @mcp.tool() def subtract(a: int, b: int) -> int: """Subtract b from a and return the result.""" return a - b if __name__ == "__main__": # Run over stdio so local hosts (VS Code, Claude Desktop) can connect mcp.run() The same idea in TypeScript, using the official SDK validated at version 1.29.0 : // server.ts — minimal MCP server in TypeScript import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; import { z } from "zod"; const server = new McpServer({ name: "Calculator", version: "1.0.0" }); // Register a tool with a typed input schema server.tool( "add", { a: z.number(), b: z.number() }, async ({ a, b }) => ({ content: [{ type: "text", text: String(a + b) }], }) ); // Connect over stdio and start listening const transport = new StdioServerTransport(); await server.connect(transport); That is a complete, runnable server. The docstrings and schemas matter: MCP exposes them to the model so it knows when and how to call each tool. Clear descriptions are effectively prompt engineering for your tools — a common pitfall is leaving them vague, which leads to the model misusing or ignoring the tool. Connecting it in VS Code Once your server runs, an MCP host connects to it. A typical VS Code / host configuration looks like this: { "servers": { "calculator": { "command": "python", "args": ["server.py"] } } } Lesson 3.12 (MCP Hosts) covers the equivalent JSON for Claude Desktop, Cursor, Cline, and Windsurf, and lesson 3.13 shows how to use the MCP Inspector to test your tools before wiring them into a host — the single best debugging habit you can build early. How the course is structured The curriculum is organized as a progressive journey with hands-on code in C#, Java, JavaScript, Python, Rust, and TypeScript. It is grouped into phases: Foundations (Modules 0–2): Introduction, Core Concepts, and Security. Building (Module 3): Getting Started — 15 lessons covering your first server and client, LLM clients, VS Code integration, stdio and HTTP streaming, testing, deployment, auth, hosts, the Inspector, sampling, and MCP Apps. Growing (Modules 4–5): Practical Implementation and Advanced Topics — 17 advanced lessons including Azure integration, OAuth2, Entra ID auth, scaling, multi-modality, context engineering, custom transports, and adversarial multi-agent reasoning. Mastery (Modules 6–11): Community Contributions, Lessons from Early Adoption, Best Practices, Case Studies, a Microsoft Foundry Toolkit workshop, and an end-to-end 13-lab PostgreSQL capstone. That final module is the standout for portfolio building: a complete, production-flavored path that takes you from architecture and row-level security through database design, a FastMCP server, semantic search with pgvector and Azure OpenAI, testing, Docker deployment to Azure Container Apps, and monitoring with Application Insights. Why developers should learn MCP now For AI engineers MCP is becoming the default integration layer for agents. Instead of re-implementing tool calling for every framework, you write to one open protocol and your tools work everywhere. The advanced modules — sampling, roots, elicitation, scaling, routing, and adversarial multi-agent patterns — are exactly the techniques you need to move agents from demo to production. For developers MCP is already wired into tools you use daily: VS Code, GitHub Copilot, Claude Desktop, Cursor, and more. Learning to build an MCP server means you can expose your systems — internal APIs, databases, CI/CD — to AI assistants safely. The security-first approach in the course (OAuth2, Entra ID, RBAC, dependency auditing) teaches you to do this the right way from day one. For students MCP is a rare opportunity to learn a technology while it is still early, with a free, beginner-friendly, Microsoft-maintained curriculum and code in six languages. The 13-lab capstone alone is a genuine portfolio project. And with content translated into 50+ languages, the barrier to entry is low no matter where you are. Responsible and secure by design A recurring theme worth calling out: the course does not treat security and governance as optional extras. It models real practices you should carry into your own work: Least privilege via roots — constrain what a server can touch. Tool annotations — mark tools readOnlyHint or destructiveHint so clients can warn users before destructive actions. No shells for user input — the command-injection fix is a textbook example of why you never pass untrusted input through a shell. Dependency hygiene — audit with npm audit and pip-audit , and pin patched releases. Proper auth — dedicated lessons on OAuth2 and Microsoft Entra ID. Key takeaways MCP standardizes how AI connects to tools and data, turning a combinatorial integration problem into a simple, reusable one. The course is current, validated against MCP Specification 2025-11-25 with SDKs at TypeScript 1.29.0 and Python mcp 1.27.2 . Samples actually run, and the repo demonstrates a full secure-development loop with 0 reported vulnerabilities after auditing. It is broad and deep: from a 10-line calculator server to a 13-lab production capstone, in six languages. It is the fastest credible path to MCP fluency for AI engineers, developers, and students alike. Get started today Open the course: https://aka.ms/mcp-for-beginners (redirects to the GitHub repository). Fork and clone it — use a sparse checkout to skip translations for a faster download: git clone --filter=blob:none --sparse https://github.com/microsoft/mcp-for-beginners.git cd mcp-for-beginners git sparse-checkout set --no-cone "/*" "!translations" "!translated_images" Build your first server with lesson 3.1 in your language of choice. Debug it with the MCP Inspector, then connect it in VS Code. Go deep with the 13-lab database capstone, and read the official spec at modelcontextprotocol.io. Track what's new in the changelog and join the community discussions. MCP is quietly becoming the connective tissue of the AI ecosystem. The earlier you learn it, the more leverage you will have — and Microsoft's MCP for Beginners is the clearest on-ramp available. Star the repo, build a server this week, and start connecting your AI to the world.How are you handling performance review data across Teams, Outlook, and SharePoint?
We just wrapped up our mid-year review cycle and it was a mess. Managers had to go through old Teams chat messages to find feedback they gave or go through our sharepoint documents to see where everyones goals were. You think there is a way I can use copilot inside Microsoft Teams to somehow parse through all of this before reviews? I saw this app performance 365 or something like that in the app store but I dont think its an official Microsoft app. Any ideas?23Views0likes0CommentsHow Many Copies of Each Layer Does Your Container Registry Actually Need?
Authors: Payal Mahesh and Vicky Lin Azure Container Registry team: Jeanine Burke and Johnson Shi Introduction It's Monday morning. You spin up a fresh 1,000-node AKS cluster for a big training run or a fleet-wide rollout. Every node reaches for the same large container image at the same instant. What actually happens in the next ten minutes - and whether your pods reach Ready in 9 minutes or 14 - turns out to depend on a single number you've probably never thought about: how many copies of each image layer exist behind your registry. At the surface, you see a single capacity number for your registry size - but behind that abstraction, Azure Container Registry maintains copies of your layer data to optimize pull performance. That number of copies directly determines the read throughput available per layer. Each copy can serve requests independently, so distributing the layer across storage allows it to be read in parallel. More copies mean more independent readers - and higher aggregate throughput when thousands of nodes pull at once. The intuitive answer is that more is better: add copies, get faster pulls. When we actually tested it at 1,000-node scale, the truth turned out to be more interesting: A few extra copies helped a little. A moderate number helped a lot, and eliminated storage throttling entirely. A large number helped no more than the moderate one. A huge number actually made pulls slower again. Think of it like opening checkout lanes at a grocery store. Opening a few more lanes when the store is slammed cuts the line dramatically. Past a certain point, though, extra lanes barely help, because by then it's the customers, not the cashiers, who are the bottleneck. And open too many? Now the staff is spread thin and tripping over each other, and the line moves worse than it did at the sweet spot. This post walks through what we measured, why the curve bends where it does, and what we're building next so finding that sweet spot isn't something anyone has to do by hand. Key Takeaways There's a sweet spot, not a slope. Adding copies per layer cut pod-startup P99 by 27% and raised P50 per-node egress throughput by 244%, but only up to a point. Past that, the returns vanish, and far past it, latency actually regresses. Storage throttling is the real enemy. The win comes from spreading load across enough storage backends that no single backend gets pinned at its egress ceiling. Once throttling is gone, more copies stop helping. Storage scale alone has a ceiling. Even at the sweet spot, the per-backend egress limit caps total throughput. The next jump in performance has to come from somewhere else, which is exactly what we're building (see What's Next). This isn't something customers should need to manage. We're building a proactive, on-demand storage scaling capability that automatically grows the footprint before throttling happens and shrinks it back when the burst is over. A quick bit of background Within a region, the layer data behind your container images is backed by Azure storage. The number of copies ACR maintains per layer determines how many independent storage backends a concurrent-pull workload can spread its reads across. That's what matters, because each backend has a finite egress ceiling. Once concurrent reads against one backend get close to that ceiling, requests start getting throttled, and your pulls slow down in proportion. The principle is simple: more copies per layer means more backends serving the same data, which means more total egress headroom and fewer throttled requests. What we wanted data on was how many, and where it stops helping. How we tested We ran a controlled series of large-scale pull tests against ACR Premium on a roughly 1,000-node cluster, with every node pulling the same large image cold at the same time (no local cache on any node). The only thing we changed between runs was the number of per-layer copies behind a single registry endpoint. Everything else, including rate limits, the image, node count, and concurrency, stayed constant. For each run we measured pod-startup latency (P50/P90/P99), end-to-end storage read latency, egress throughput distributions (P50-P99.9), and storage throttling events. Pod-startup latency is our headline metric, because it's the one number that reflects the actual customer experience no matter where the bottleneck happens to be. Per-node egress throughput matters too, though. It tells you directly how much pull bandwidth ACR delivers to your fleet, and it's usually what customers have in mind when they ask how much faster extra copies will make their pulls. We report egress as a distribution rather than a single average, since per-request and per-time-window views can tell very different stories about the same set of pulls. These are observations from a single controlled environment, not a service guarantee. Absolute numbers will move with image size, node count, layer composition, network topology, and concurrency. What we found We tested five configurations, sweeping from a low baseline number of per-layer copies up to a very high one. We name them by relative copy count rather than exact instance counts: Baseline: the lowest level, our reference point. Low: a modest step up from Baseline. Mid: a meaningful step up from Low. Higher: a further step up from Mid. Very high: the largest configuration we tested, well above Higher. Here are the numbers. All percent changes are relative to Baseline. Configuration Pod startup P50 Pod startup P90 Pod startup P99 Storage throttling events Peak per-backend egress Baseline (fewest copies) 9m 36s 11m 0s 14m 16s Many; all top backends above the egress ceiling Highest Low 9m 27s (−2%) 10m 14s (−7%) 12m 59s (−9%) Some; one backend still above the ceiling High Mid 9m 25s (−2%) 9m 45s (−11%) 10m 22s (−27%) Zero Below the ceiling Higher 9m 20s (−3%) 9m 37s (−13%) 10m 22s (−27%) Zero Well below the ceiling Very high 9m 28s (−1%) 10m 31s (−4%) 13m 48s (−3%) Zero Lowest Look at the P99 pod-startup column from top to bottom: 14m 16s, 12m 59s, 10m 22s, 10m 22s, 13m 48s. It improves, flattens out, then climbs back up. Three things explain that shape: 1. The win: Throttling falls off a cliff at the Mid configuration As we added copies per layer, per-backend egress fell and storage-side throttling decreased. At the Mid configuration, throttling errors hit zero, and they stayed at zero for every configuration above it. The upside isn't just that the errors went away, though. It's raw pull bandwidth. At the Mid sweet spot, the typical node saw its P50 egress throughput jump 244% over Baseline. With load spread across enough copies, each node pulled its layers off storage much faster, not just without stalling. For a workload owner, that's the difference between watching pods come up in a steady stream and watching them stall for tens of seconds at a time while throttling clears. Same image, same node count, same registry, very different experience. To put it in concrete terms: if your team runs a daily AI training kickoff that needs all 1,000 nodes pulling before the job can start, this is the difference between starting on time and starting four minutes late every day. Over a quarter of training runs, that adds up. 2. The surprise: more copies made pulls slower This is the finding that genuinely surprised us. Going from Higher to Very high, the largest configuration we tested, cost us 3 minutes and 26 seconds at P99: 10m 22s climbing back up to 13m 48s. That gave back almost the entire benefit we'd built up over the previous four configurations. Tail storage-read latency at Very high actually came out worse than Baseline. The Very high run is where the wheels came off, and the reason is the trade-off underneath. Once storage throttling is gone, more copies stop buying you anything, and the cost of fanning reads across that many backends starts to take over. The throughput distribution shows it clearly. P50 and P75 throughput had been climbing steadily and getting smoother through Mid and Higher, then dropped sharply at Very high while the peak P99/P99.9 spikes came back. Spread the same load across too many backends and it fragments into smaller, less consistent bursts. The takeaway is that "more is better" stops being true past the sweet spot, and the failure mode is quiet. You won't see throttling errors. You'll just see your pulls get slower. 3. What we didn't expect: at few copies, the hottest backend is what hurts you At the lowest copy counts, pull traffic wasn't spread evenly across the underlying storage footprint. Some backends absorbed far more traffic than others. As we added copies, that distribution evened out and the hottest backends cooled down. The implication is sharp. You can saturate the busiest backend, and trigger throttling, even when the total headroom across all your backends is large in aggregate. What matters is the load on the hottest backend, not the average. That's exactly the failure mode that demand-driven, proactive scaling (described below) is meant to head off before it happens. So how should you think about this? You don't size copies yourself; ACR manages the storage footprint behind your registry. Still, it helps to understand what moves the sweet spot, because the shape of your own workload is what decides where it lands. The bigger your worst-case concurrent burst (more nodes, larger images, higher concurrency), the more copies per layer it takes to keep pulls off the throttling ceiling, and the further out the sweet spot sits. Smaller workloads may already be sitting on the flat part of the curve. One thing is worth saying plainly. The storage footprint underneath is managed by ACR and shared across many registries, so there's no fixed, private storage budget that maps one-to-one to your workload. The sweet spot isn't a number you compute and provision; it's a behavior the platform has to land on for you, which is exactly why we're moving toward demand-driven scaling that handles it automatically. That's what brings us to what we're building next. What's next: proactive, on-demand storage scaling and a caching layer The fixed-copy tests above answer the question "how many should the ACR system provision?" but they assume a single, static answer. Real workloads aren't static. A 1,000-node burst happens at deploy time, not at 3 a.m. on a Tuesday. And no matter how many copies are provisioned, the per-backend storage ceiling still bounds peak deliverable throughput. So we're investing along two complementary directions. 1. Proactive, demand-driven storage scaling We're building a capability that adjusts the number of per-layer copies automatically based on real-time pull demand: Proactive, not reactive. The system scales the storage footprint before concurrent pull pressure pushes any single backend near the throttling threshold, so throttling is prevented before it forms rather than cleaned up after the fact. On-demand scale-out. The footprint expands automatically as sustained pull demand grows. Scale-in when demand subsides. The footprint contracts so you're not paying for steady-state capacity you only needed during a burst. Tiering for cold content. Long-tail, rarely-pulled content can sit on colder storage, so the redundant footprint of frequently-pulled content doesn't pay full hot-storage cost everywhere. The benefit to customers is straightforward: smoother pulls under burst, higher delivered throughput on average, no permanent over-provisioning, and no manual re-tuning as workloads grow. 2. A caching layer to absorb burst beyond the storage ceiling Even a perfectly scaled storage footprint runs into the per-backend egress ceiling at extreme scale. To push past it, we're investing in a caching layer in the registry service that absorbs burst traffic before it ever reaches storage. A pull surge that hits the same set of layers, which is the common case for fleet-wide deployments, can be served largely from cache. That takes a lot of load off any single storage backend and complements the storage scaling above. We'll share results from this work in follow-up posts. If you have questions about scaling ACR for your workload, or about how we measure storage performance, reach out on the Azure Container Registry GitHub repository. Note: All results in this post are based on controlled internal testing configurations and are intended to illustrate general scaling behavior rather than prescribe exact configurations.178Views0likes0CommentsMastering Query Fields in Azure AI Document Intelligence with C#
Introduction Azure AI Document Intelligence simplifies document data extraction, with features like query fields enabling targeted data retrieval. However, using these features with the C# SDK can be tricky. This guide highlights a real-world issue, provides a corrected implementation, and shares best practices for efficient usage. Use case scenario During the cause of Azure AI Document Intelligence software engineering code tasks or review, many developers encountered an error while trying to extract fields like "FullName," "CompanyName," and "JobTitle" using `AnalyzeDocumentAsync`: The error might be similar to Inner Error: The parameter urlSource or base64Source is required. This is a challenge referred to as parameter errors and SDK changes. Most problematic code are looks like below in C#: BinaryData data = BinaryData.FromBytes(Content); var queryFields = new List<string> { "FullName", "CompanyName", "JobTitle" }; var operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, modelId, data, "1-2", queryFields: queryFields, features: new List<DocumentAnalysisFeature> { DocumentAnalysisFeature.QueryFields } ); One of the reasons this failed was that the developer was using `Azure.AI.DocumentIntelligence v1.0.0`, where `base64Source` and `urlSource` must be handled internally. Because the older examples using `AnalyzeDocumentContent` no longer apply and leading to errors. Practical Solution Using AnalyzeDocumentOptions. Alternative Method using manual JSON Payload. Using AnalyzeDocumentOptions The correct method involves using AnalyzeDocumentOptions, which streamlines the request construction using the below steps: Prepare the document content: BinaryData data = BinaryData.FromBytes(Content); Create AnalyzeDocumentOptions: var analyzeOptions = new AnalyzeDocumentOptions(modelId, data) { Pages = "1-2", Features = { DocumentAnalysisFeature.QueryFields }, QueryFields = { "FullName", "CompanyName", "JobTitle" } }; - `modelId`: Your trained model’s ID. - `Pages`: Specify pages to analyze (e.g., "1-2"). - `Features`: Enable `QueryFields`. - `QueryFields`: Define which fields to extract. Run the analysis: Operation<AnalyzeResult> operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, analyzeOptions ); AnalyzeResult result = operation.Value; The reason this works: The SDK manages `base64Source` automatically. This approach matches the latest SDK standards. It results in cleaner, more maintainable code. Alternative method using manual JSON payload For advanced use cases where more control over the request is needed, you can manually create the JSON payload. For an example: var queriesPayload = new { queryFields = new[] { new { key = "FullName" }, new { key = "CompanyName" }, new { key = "JobTitle" } } }; string jsonPayload = JsonSerializer.Serialize(queriesPayload); BinaryData requestData = BinaryData.FromString(jsonPayload); var operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, modelId, requestData, "1-2", features: new List<DocumentAnalysisFeature> { DocumentAnalysisFeature.QueryFields } ); When to use the above: Custom request formats Non-standard data source integration Key points to remember Breaking changes exist between preview versions and v1.0.0 by checking the SDK version. Prefer `AnalyzeDocumentOptions` for simpler, error-free integration by using built-In classes. Ensure your content is wrapped in `BinaryData` or use a direct URL for correct document input: Conclusion Using AnalyzeDocumentOptions provides a cleaner and more reliable way to work with query fields in Azure AI Document Intelligence using C#. By aligning with the latest SDK approach, developers can simplify implementation, reduce common errors, and improve code maintainability. Keeping up with SDK enhancements and recommended practices ensures more accurate and efficient document data extraction. As Azure AI capabilities continue to evolve, adopting modern integration patterns will help you build scalable and future-ready document processing solutions with greater confidence. Reference Official AnalyzeDocumentAsync Documentation. Official Azure SDK documentation. Azure Document Intelligence C# SDK support add-on query field.477Views0likes0CommentsMicrosoft Leads a New Era of Software Supply Chain Transparency
Microsoft announces the general availability of Microsoft’s Signing Transparency (MST) – a first-of-its-kind capability that brings unprecedented visibility and trust to our software supply chain. With this release, Microsoft is leading the industry by recording the build of critical cloud services into a publicly readable and verifiable SCITT standard (Supply Chain Integrity, Transparency, and Trust) compliant ledger. This means every production software build for in scope services like Azure Attestation and Azure Managed HSM (Hardware Security Module), Azure confidential ledger, Microsoft Signing Transparency itself (and others over time) – is now logged in an immutable, tamper-evident record. Only builds that are in the MST ledger are deployed to production; this gives customers confidence that the supply chain for these critical services can be audited at anytime. Notably, the MST ledger is fully open source and built to align with the emerging IETF SCITT standard. By embracing SCITT’s principles and open protocols, Microsoft ensures that MST not only secures our own ecosystem but also contributes to a broader industry movement toward standardized supply chain transparency. The open-source MST ledger serves as a verifiable trust anchor that any organization or researcher can inspect, audit, or even integrate with their own tooling. MST itself meets the highest levels of transparency, backed by a tamper-proof confidential ledger, open-source, and independently verified. Specifically, we are making the foundation of our trust model transparent and accessible to everyone – reinforcing that trust must be earned through proof, not just promises. This launch marks a major milestone in our commitment to Zero Trust principles, extending “never trust, always verify” all the way into the build itself. Building on a public preview introduced late last year, MST’s general availability delivers verifiable transparency at the software level. It transforms traditional code signing with an additive trust layer that is accessible via an open verification model. Every new software update is accompanied by a publicly auditable proof of integrity, enabling security teams to proactively confirm that each update is authentic and unaltered. To help organizations get the most out of this capability, we are also introducing a free tool to explore the contents – Ledger Explorer – an offline tool that allows security teams to examine MST ledger entries, verify cryptographic proofs, and even validate the ledger’s integrity independently. This tool, combined with MST’s open design, ensures that every Microsoft customer – and the broader community – can hold us accountable in real time for the software we run on their behalf. Key Benefits of Microsoft’s Signing Transparency (MST) Verified Code Integrity – Every software release is cryptographically logged in MST’s ledgers. This makes each build tamper-evident and traceable. If an attacker attempts to inject malicious code or sign an unauthorized update, it will be evident through the well-defined validation step built into the SCITT standard. Organizations gain the assurance that code integrity can be independently confirmed at any time. Independent Verification & Zero Trust – MST enables customers and auditors to verify software authenticity on their own, without having to solely rely on vendor attestations. For each update, Microsoft provides a transparency “receipt” (proof of logging) that you can use to prove the update was officially published and unaltered. This fosters a “don’t just trust, verify” approach, empowering security teams to double-check everything running in their environment aligns with what Microsoft intended. Audit-Trail & Compliance – The transparency ledger creates a permanent, auditable timeline of code deployments. Every entry is a record of what was released and when, backed by cryptographic proofs. This simplifies compliance reporting and accelerates forensic analysis. In the event of an incident, you can quickly audit the ledger to see if any unexpected code was introduced. For highly regulated industries, MST offers concrete evidence of software integrity and policy compliance over time. Leadership & Open Standards – We are delivering real transparency now, encouraging a future where all critical software is released with verifiable integrity. MST’s open source implementation and SCITT-compliant design exemplify our commitment to openness and collaboration. We believe widespread adoption of these standards will strengthen supply chain security for everyone, making trust verification a universal practice. Next Steps Microsoft’s Signing Transparency is more than a new security feature and shapes the advances in trust technology. As threats grow more sophisticated, we must evolve the way we assure our customers about the software they depend on. With MST now generally available, we are leading by example: proving that it is possible to open up the traditionally opaque process of software deployment and turn it into a source of strength and trust, i.e. empowering each person with verifiable transparency. We invite the industry to join us on this journey and get started by reading the documentation and exploring Ledger Explorer today! Together, by embracing transparency and open standards, we can turn “trust but verify” from a slogan into an everyday reality for digital infrastructure.Using Keycloak with Azure AD to integrate AKS Cluster authentication process
Integrating Azure Kubernetes Service (AKS) with Keycloak through Azure Active Directory (Azure AD) as an intermediary leverages Azure AD’s support for OpenID Connect (OIDC) to handle authentication and authorization. This integration enhances security, streamlines user management, and simplifies the authentication process for users accessing the AKS cluster.Windows 365 and developer environments: how do you balance security and productivity?
Hi everyone, I’d like to raise a topic that we are currently struggling with, and I suspect many other organizations are facing the same challenge. We are in the process of establishing a Windows 365–based development environment, where developers work in Cloud PCs. This is largely driven by: a BYOD strategy security requirements (no sensitive code on unmanaged devices) the need for standardization However, this quickly becomes complex in practice. The core challenge We are trying to balance three competing priorities: 1. Security requirements No sensitive code on local devices Minimal attack surface Zero Trust principles and Conditional Access Full traceability of identity and actions 2. Developer needs Local admin rights to be able to do their work Freedom to install tools, SDKs, and runtimes Flexibility without constant blocking Fast iteration cycles The reality is that if it takes too long to get access or permissions, it breaks the developer workflow. 3. IT and governance Standardization of environments Manageability and patching License and cost control Compliance and auditability The practical dilemma Developers want to be local admins on their machines Security teams prefer: Just-In-Time access (PIM), or No admin privileges at all In practice: PIM tends not to work well for developers It introduces too much friction It disrupts flow and often leads to workarounds What we are currently exploring We are testing a model where: Developers work in Windows 365 Cloud PCs They use their regular corporate identity (Entra ID) Isolation is achieved through the environment, not separate accounts Developers have local admin rights within the Cloud PC However, this raises a new question: How do we secure an environment where the user is an admin? Questions to the community I would really appreciate insights from others who have been through similar scenarios: 1. Identity vs privilege Do you use the same identity for everything, or separate user/admin accounts? How far do you take identity separation? 2. Local admin rights Do you allow developers to have local admin rights? Is it permanent or Just-In-Time? If JIT, how do you make it work without impacting productivity? 3. Cloud-based development environments If you are using Windows 365, Dev Box, or AVD: Has this made it easier to relax restrictions? Or are you facing the same challenges, just in the cloud? 4. Guardrails instead of restrictions Instead of trying to prevent everything: EDR / endpoint protection Conditional Access Network isolation Monitoring and detection Has anyone successfully shifted from strict control to strong guardrails and detection? Current reflection I am starting to think that: Focusing on secure, isolated environments for development may be more effective than trying to tightly control every individual action. In other words: secure the platform not every single user behavior But this is far from straightforward. Purpose of this discussion The goal is to find a realistic blueprint that: maintains high developer productivity meets security requirements minimizes friction in day-to-day work Not something theoretically perfect, but something that actually works. If you have experience in this area, I would really value your input: what has worked well what has not worked key design decisions you would recommend Thanks in advance.28Views0likes0CommentsCommunities tab in Teams
Hi there I've recently had the Communities tab pop up in my Teams alongside the Teams and Channels tab. No one else in my organisation can see this yet and we aren't sure why. I know it's being rolled out on a timeline but I was also wondering if it might be because I'm the only one in the org who has an Microsoft Viva Employee Communications and Communities licence? Does anyone have any insights into this? We'd like to make a bit of a roll out plan once this appears in our colleagues Team's set ups.120Views0likes2CommentsStruggling to get managers to actually use 1:1 meeting agendas in Teams
We've been trying to get our managers to run structured 1:1s with their direct reports using Teams. Right now they just hop on a call with no agenda and wing it. HR wants there to be a documented agenda, talking points from both sides, and some kind of record of what was discussed. We tried using Loop components and OneNote but managers find it clunky to set up every time and most of them just stopped doing it after a few weeks. Is there a better way to handle recurring 1:1 meeting agendas directly in Teams?42Views0likes2Comments