openai

82 Topics

Set Up Plaud Note Pro with Microsoft Foundry
Prerequisites Riffado, up and running: follow the setup guide in the official Riffado repository to get it going with Docker Compose. A Microsoft Foundry (formerly Azure AI Foundry) resource, with the models you want deployed; in my case, whisper for transcription and o3-mini for summaries. A Plaud device, or any audio recordings you can import into Riffado. Once Riffado is up, head to the Settings page > Providers > Add Provider, and select Custom. This is where the Azure details will go. Why "OpenAI-compatible" isn’t one thing on Microsoft Foundry Azure AI Foundry exposes two different API surfaces on the same resource, and which one serves your model depends on the model: Surface Path shape Serves OpenAI-compatible? v1 route /openai/v1/… gpt-4o-transcribe, gpt-4o-mini-transcribe, chat models, embeddings Yes: Bearer auth, model in the body, no api-version needed Classic route /openai/deployments/{name}/… Whisper (and other legacy audio) No: deployment name lives in the URL, and ?api-version= is mandatory A generic OpenAI client (Riffado's included) can only speak the first dialect. It has nowhere to put a deployment name in the path and no way to append a query parameter. That single fact drives everything below. Part 1 - Transcription Whisper and the DeploymentNotFound mystery Symptom My very first transcription attempt in Riffado failed with 404 Resource not found. Off to a flying start. Configured provider: base URL https://<resource>.services.ai.azure.com, model whisper. Dead end #1: the missing path The first bug was mine: the base URL had no path. Riffado's OpenAI client appends /audio/transcriptions to whatever you give it, so requests were hitting https://<resource>…/audio/transcriptions, a path that doesn't exist on the resource at all. Fixing the base URL to end in /openai/v1 got us to a more interesting error: POST /openai/v1/audio/transcriptions · model=whisper {"error":{"code":"DeploymentNotFound","message":"The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again."}} Dead end #2: catalog ≠ deployment Worth checking before anything else: selecting a model in the Foundry catalog is not deploying it. GET /openai/v1/models lists everything you could deploy; only Deployments → Deploy model creates an endpoint that answers. If you get DeploymentNotFound, first confirm a deployment actually exists (the listing below requires only the API key): enumerate real deployments (classic control-plane, key auth) curl -s -H "api-key: $KEY" \ "https://<resource>.openai.azure.com/openai/deployments?api-version=2023-03-15-preview" # → {"data":[{"id":"whisper","model":"whisper","status":"succeeded",…}]} The actual cause Here is the part that nearly drove me mad: the deployment existed and was succeeded, yet the v1 route still said DeploymentNotFound. Because Whisper deployments are not served on the v1 route at all. They only answer on the classic path. Verified side by side with the same tiny WAV file: Request Result POST /openai/v1/audio/transcriptions · model=whisper · Bearer 404 DeploymentNotFound POST /openai/deployments/whisper/audio/transcriptions?api-version=2024-06-01 · Bearer 200 {"text":"you"} Same classic path, without ?api-version= 404 Resource not found Three constraints, then: Whisper needs the classic path; the classic path needs api-version; Riffado can send neither. One piece of good news hiding in the table: the classic route accepts Authorization: Bearer, not just Azure's api-key header, so the shim doesn't have to touch auth at all. The fix: a Caddy shim Drop a stock caddy:2-alpine container into the Compose network. Riffado points at it as if it were OpenAI; the shim rewrites the path, injects api-version, and proxies to Azure. The Bearer header passes through untouched. azure-shim.Caddyfile { admin off auto_https off } :80 { @transcribe path /v1/audio/transcriptions /audio/transcriptions handle @transcribe { rewrite * /openai/deployments/whisper/audio/transcriptions?api-version=2024-06-01 reverse_proxy https://<resource>.services.ai.azure.com { header_up Host <resource>.services.ai.azure.com } } handle { respond "azure-shim ok" 200 } } docker-compose.yml (added service) azure-shim: image: caddy:2-alpine restart: unless-stopped volumes: - ./azure-shim.Caddyfile:/etc/caddy/Caddyfile:ro Riffado's provider settings become: Field Value Base URL http://azure-shim/v1 Model whisper (must equal the deployment name) API key the Azure resource key (forwarded as Bearer) Verified From inside the Riffado container: POST http://azure-shim/v1/audio/transcriptions → 200 {"text":"…"}. Transcription works end-to-end in the UI. Part 2 · Summaries & titles o3-mini and the empty answer Symptom The summary button showed "An unexpected error occurred." The container logs were more honest: riffado-app logs Error generating title: TypeError: undefined is not an object (evaluating 'C.choices[0]') Riffado calls chat/completions and reads choices[0] without checking whether the response was an error. So anything the API refuses becomes "an unexpected error." What was it refusing? Cause 1: reasoning models reject the classic knobs o3-mini belongs to Azure/OpenAI's o-series reasoning models, which hard-reject parameters every classic chat client sends. Riffado sends temperature: 0.7 and max_tokens: 50 for titles (0.5 / 2000 for summaries), and o3-mini answers: POST /openai/v1/chat/completions · model=o3-mini HTTP 400 {"error":{"message":"Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", …}} # and with max_tokens fixed: HTTP 400 {"error":{"message":"Unsupported parameter: 'temperature' is not supported with this model.", …}} Cause 2: reasoning tokens starve the output Stripping the bad params gets you to 200, and then comes a subtler failure, my personal favourite of this whole saga. Reasoning models spend completion tokens on internal "thinking" before emitting a single visible character. Riffado's 50-token title budget is consumed entirely by reasoning, and the reply comes back syntactically valid and empty: max_completion_tokens reasoning_effort finish_reason content 50 not set length "" (all 50 spent reasoning) 2000 not set stop "Q3 Budget Planning Strategy Meeting" 2000 low stop same, less reasoning overhead The fix: a Node shim that rewrites the request body Caddy can rewrite paths but not JSON bodies, so this shim is ~60 lines of dependency-free Node on node:20-alpine. Per request it: converts max_tokens → max_completion_tokens, strips temperature / top_p / penalties, floors the token budget at 4000, sets reasoning_effort: "low", maps /v1/* → /openai/v1/*, and forwards to the Azure resource. o3-shim.js const http = require('http'); const https = require('https'); const UPSTREAM_HOST = '<resource>.services.ai.azure.com'; // Params o-series reasoning models reject on chat/completions. const STRIP = ['temperature','top_p','presence_penalty', 'frequency_penalty','logprobs','top_logprobs']; const server = http.createServer((req, res) => { const chunks = []; req.on('data', c => chunks.push(c)); req.on('end', () => { let body = Buffer.concat(chunks); // Riffado's base_url is http://o3-shim/v1 → map to Azure's /openai/v1 let path = req.url; if (path.startsWith('/v1/')) path = '/openai' + path; const ct = (req.headers['content-type'] || '').toLowerCase(); if (ct.includes('application/json') && body.length) { try { const j = JSON.parse(body.toString('utf8')); if (j && typeof j === 'object' && !Array.isArray(j)) { if ('max_tokens' in j) { if (!('max_completion_tokens' in j)) j.max_completion_tokens = j.max_tokens; delete j.max_tokens; } // Reasoning spends tokens before any visible output; small // budgets (Riffado sends 50 for titles) return empty strings. if (Array.isArray(j.messages)) { j.max_completion_tokens = Math.max(Number(j.max_completion_tokens) || 0, 4000); if (!('reasoning_effort' in j)) j.reasoning_effort = 'low'; } for (const k of STRIP) delete j[k]; body = Buffer.from(JSON.stringify(j)); } } catch (_) { /* not JSON - forward untouched */ } } const headers = { ...req.headers, host: UPSTREAM_HOST, 'content-length': Buffer.byteLength(body) }; const up = https.request( { host: UPSTREAM_HOST, port: 443, method: req.method, path, headers }, upRes => { res.writeHead(upRes.statusCode, upRes.headers); upRes.pipe(res); } ); up.on('error', e => { res.writeHead(502, {'content-type':'application/json'}); res.end(JSON.stringify({error:{message:'o3-shim upstream error: '+e.message}})); }); up.end(body); }); }); server.listen(80, () => console.log('o3-shim listening on :80')); docker-compose.yml (added service) o3-shim: image: node:20-alpine restart: unless-stopped working_dir: /app command: ["node", "/app/o3-shim.js"] volumes: - ./o3-shim.js:/app/o3-shim.js:ro Add a second provider in Riffado (base URL http://o3-shim/v1, model o3-mini, the resource's API key) and set it as the default enhancement provider (summaries/titles), keeping the Whisper one as default for transcription. Riffado's exact title request (temperature: 0.7, max_tokens: 50) through the shim → 200, finish_reason: stop, real title text. A full meeting-transcript summary returns structured key points and action items. The final shape Reading it left to right: Riffado never talks to Azure directly. Transcription requests pass through azure-shim, a stock Caddy container that rewrites each request onto Whisper's classic deployment path and injects the mandatory api-version parameter. Summary and title requests pass through o3-shim, a tiny Node server that rewrites the request body into the shape o3-mini accepts and floors the token budget so the model's internal reasoning cannot starve the actual answer. As far as Riffado is concerned, it is simply talking to two ordinary OpenAI providers. Both shims live on the Compose network only; nothing is exposed publicly. Riffado is unmodified. Verification checklist Each layer, testable in isolation. Run these before blaming the app: smoke tests # 1. Key + resource alive? (v1 models listing, Bearer auth) curl -s -H "Authorization: Bearer $KEY" \ https://<resource>.services.ai.azure.com/openai/v1/models | head -c 200 # 2. Whisper answers on the classic path? curl -s -H "Authorization: Bearer $KEY" -F file=@test.wav \ "https://<resource>.services.ai.azure.com/openai/deployments/whisper/audio/transcriptions?api-version=2024-06-01" # 3. Shim translates correctly? (from inside the compose network) docker exec riffado-app node -e "fetch('http://azure-shim/') .then(r=>r.text()).then(console.log)" # 4. o3-mini via shim, sending the params Riffado sends? # (temperature + max_tokens:50; the shim must absorb both) If you'd rather not run shims Both shims exist because of the specific models chosen. Pick models that live natively on the v1 route and Riffado connects directly, with base URL https://<resource>.services.ai.azure.com/openai/v1 and zero extra containers: Transcription: deploy gpt-4o-mini-transcribe (or gpt-4o-transcribe) instead of Whisper. Summaries: deploy a non-reasoning chat model such as gpt-4o-mini, which happily accepts temperature and max_tokens. The shim approach earns its keep when you're standardized on specific models (Whisper's transcription quality, o3-mini's reasoning), or when you want a control point to add logging, retries, or budget caps later. For reference, this is what the finished setup looks like on Riffado's side. Each shim is registered as a plain Custom provider. Here is the whisper provider pointing at azure-shim, with Use for transcription ticked: And once both are saved, they sit side by side in the providers list, whisper tagged for transcription and o3-mini tagged for enhancement: A quick look at the Foundry portal In the Microsoft Foundry portal, head over to Models > AI Services and you will find a pleasant surprise: fifteen AI service models already deployed and ready to use, covering the Azure Speech family (including Voice Live and Speech to Text), Azure Translator, Azure Language, and Content Understanding: You can of course deploy another model for this, but the pre-deployed ones are a handy cost-saving option. Click on the Azure Speech – Voice Live radio button and you will be shown the Base URL and API Key, which you can then paste into the provider settings on Riffado's Settings page. A quick note on cost: these services are not free. They are billed pay-as-you-go based on usage. Azure Speech transcription is charged per audio hour, and Voice Live pricing is tiered by the model you choose. The free tier does include a monthly allowance, though. Check the Azure Speech pricing page before committing. And if you would rather deploy a dedicated transcription model such as whisper, Foundry gives you the flexibility to do just that. Open the model page in the catalogue, click Deploy, and go with Default settings unless you need custom quotas or guardrails: Let's test the setup On your Plaud device, just tap to start recording. The little LED bars light up to show it is listening: Or skip the device entirely and upload an audio file straight into Riffado using the Upload Audio button. Either way, the recording lands on the Recordings page; hit Transcribe and let the spinner do its thing: As you can see below, whisper, the transcription model we deployed earlier, even managed to transcribe a recording in Malay without a hitch. My 3:32 test clip came back as 186 words of clean Malay, with the language correctly detected and tagged: I have also set o3-mini as the enhancement provider, and it enhanced the transcription with a proper summary, key points, and title as well! The Meeting Notes-style summary came straight out of o3-mini through the shim, with zero manual prompting. Wrapping up What started as a TikTok-fuelled impulse buy nearly killed off by subscription pricing ended up as a fully self-hosted pipeline: Plaud for recording, Riffado as the interface, and Microsoft Foundry serving whisper and o3-mini behind two tiny shims. The total extra infrastructure came to two containers and roughly sixty lines of code, and not a single monthly subscription in sight. If you try this setup and run into a failure mode I have not covered here, do share it in the comments. Half the fun is in the debugging.
suzarilshah
Jul 22, 2026 Place Educator Developer Blog
30Views
0likes
0Comments
Microsoft Agent Framework Multi-Agent Workflow Architecture for Automated Kubernetes Assessments
Why this system generates tests (design rationale) This project does not generate tests just to "check code." It generates tests because, in a Kubernetes learning game, the test suite is the grading contract. The design goals are: Scale content creation: instructors should not hand-author every task and checker. Keep grading objective: student success is measured against Kubernetes API state, not subjective review. Avoid fragile tasks: generated tasks must survive empty/wrong cluster states without crashing. Make failures repairable: when checks break, the system should patch and re-validate automatically. So the pipeline generates a full task package (setup, answer, check, cleanup) where tests define exactly what "correct" means. Lifecycle: from concept to production-ready grader Reader mental model: this is a content compiler with validation stages, not a single chat completion. Phase 1: Pedagogical intent -> structured concept The Idea Agent converts a topic into a constrained concept object (objective, progression, task IDs, difficulty variants). Memory rules block duplicates and previously failed concepts. Phase 2: Concept -> executable grading package The Generator Agent turns that concept into files: student instructions (instruction.md) learning material (concept.md) parameter source (session.json) setup and answer manifests (setup.template.yaml, answer.template.yaml) deterministic pytest flow (test_01 ... test_06) At this point, output is still untrusted draft content. Phase 3: Structural correctness gate Deterministic validation checks file presence, syntax, JSON/YAML shape, and template correctness. This catches basic integrity issues before cluster execution. Phase 4: Behavioral correctness gate (real cluster) Pytest executes against Kubernetes and verifies runtime behavior using real kubectl-derived state. This proves that generated checks actually evaluate cluster resources as intended. Phase 5: Anti-false-positive gate (skip-answer mode) The same suite runs with SKIP_ANSWER_TESTS=True to verify grader integrity: answer deployment is skipped test_05_check.py must fail If it still passes, the grader is invalid (it would accept wrong student submissions). Phase 6: Self-healing repair loop On any failure, deterministic error logs are fed into the Fixer Agent, which patches only broken files. The workflow then re-enters validation + test gates. Phase 7: Finalization If all gates pass -> task is kept as production-ready content. If retries are exhausted -> task is moved to unsuccessful/ with FAILURE_REPORT.txt for human triage. This lifecycle explains the core architecture decision: LLMs generate candidate graders, deterministic execution certifies them. Concrete generated sample (what the pipeline actually produces) Below is a representative generated task for topic: ConfigMap Environment Variable Injection. Generated directory layout tests/game01/050_configmap_env_injection/ ├── __init__.py ├── instruction.md ├── concept.md ├── session.json ├── setup.template.yaml ├── answer.template.yaml ├── test_01_setup.py ├── test_02_ready.py ├── test_03_answer.py ├── test_05_check.py └── test_06_cleanup.py session.json (runtime variables) { "namespace": "{{random_name()}}{{random_number(100,999)}}{{student_id()}}", "configmap_name": "{{random_name()}}", "deployment_name": "{{random_name()}}", "container_name": "app", "env_key": "APP_MODE", "env_value": "production" } Why this exists: task values are randomized per student/session, so tests verify behavior by variable contract instead of hardcoded names. setup.template.yaml (baseline state only) apiVersion: v1 kind: Namespace metadata: name: {{ namespace }} Why this exists: setup should create prerequisites only. It must not accidentally include the final answer. answer.template.yaml (expected correct solution) apiVersion: v1 kind: ConfigMap metadata: name: {{ configmap_name }} namespace: {{ namespace }} data: {{ env_key }}: "{{ env_value }}" --- apiVersion: apps/v1 kind: Deployment metadata: name: {{ deployment_name }} namespace: {{ namespace }} spec: replicas: 1 selector: matchLabels: app: env-demo template: metadata: labels: app: env-demo spec: containers: - name: {{ container_name }} image: nginx:latest env: - name: {{ env_key }} valueFrom: configMapKeyRef: name: {{ configmap_name }} key: {{ env_key }} Why this exists: defines canonical "correct cluster state" that graders must detect. test_02_ready.py (wait for setup resources) import json import time from tests.helper.kubectrl_helper import build_kube_config, run_kubectl_command class TestReady: def test_001_namespace_active(self, json_input): kube_config = build_kube_config( json_input["cert_file"], json_input["key_file"], json_input["host"] ) time.sleep(2) result = run_kubectl_command( kube_config, f"kubectl get namespace {json_input['namespace']} -o json", ) data = json.loads(result) assert data.get("status", {}).get("phase") == "Active" Why this matters: validates setup-stage readiness only. It should not check answer resources yet. test_05_check.py (student grading contract) import json from tests.helper.kubectrl_helper import build_kube_config, run_kubectl_command class TestCheck: def test_001_configmap_key_exists(self, json_input): kube_config = build_kube_config( json_input["cert_file"], json_input["key_file"], json_input["host"] ) result = run_kubectl_command( kube_config, f"kubectl get configmap {json_input['configmap_name']} -n {json_input['namespace']} -o json", ) data = json.loads(result) assert data["data"][json_input["env_key"]] == json_input["env_value"] def test_002_deployment_uses_configmap_env(self, json_input): kube_config = build_kube_config( json_input["cert_file"], json_input["key_file"], json_input["host"] ) result = run_kubectl_command( kube_config, f"kubectl get deployment {json_input['deployment_name']} -n {json_input['namespace']} -o json", ) data = json.loads(result) env = data["spec"]["template"]["spec"]["containers"][0].get("env", []) matched = [ e for e in env if e.get("name") == json_input["env_key"] and e.get("valueFrom", {}).get("configMapKeyRef", {}).get("name") == json_input["configmap_name"] and e.get("valueFrom", {}).get("configMapKeyRef", {}).get("key") == json_input["env_key"] ] assert matched, "Deployment container must consume env var from ConfigMap key" Why this matters: this is the real grading logic. If a student deploys wrong resource wiring, this test fails with explicit reason. Why skip-answer validation is essential for this sample When SKIP_ANSWER_TESTS=True, answer deployment is skipped. In this mode: test_03_answer.py should be skipped test_05_check.py must fail (ConfigMap/Deployment wiring is absent) If test_05_check.py still passes, the grader is broken (false positive), and the workflow routes to Fixer. Rule Builder Workflow Flowchart Multi-graph architecture views 1) Control-plane graph (orchestration DAG) 2) Runtime sequence (who calls what) 3) State machine (task lifecycle) 4) Prompt lifecycle graph (how prompts evolve) 5) MCP + Kubernetes execution boundary graph Core Agent Framework Primitives Used This repository is a practical example of Agent Framework as a graph orchestrator, not just an agent wrapper. WorkflowBuilder builds a typed DAG with explicit edges. @executor functions implement deterministic nodes (validation, pytest, decisions, routing prep). AgentExecutor wraps LLM agents so they behave as graph nodes. WorkflowContext shared state carries typed data and retry metadata between nodes. add_multi_selection_edge_group(...) + selector functions enforce conditional routing. MCPStdioTool connects filesystem MCP tools into agents for controlled file I/O. Production graph construction (from workflow/builder.py) looks like this: workflow = ( WorkflowBuilder(start_executor=initialize_retry) .add_edge(initialize_retry, generator_executor) .add_edge(generator_executor, parse_generated_task) .add_edge(parse_generated_task, run_validation) .add_edge(run_validation, run_pytest) .add_edge(run_pytest, make_decision) .add_multi_selection_edge_group( make_decision, [keep_task, remove_task], selection_func=select_action, ) .add_edge(keep_task, run_pytest_skip_answer) .add_multi_selection_edge_group( run_pytest_skip_answer, [check_loop, complete_workflow], selection_func=select_skip_answer_action, ) .add_multi_selection_edge_group( check_loop, [fix_task, complete_workflow], selection_func=select_loop_action, ) .add_edge(fix_task, fixer_executor) .add_edge(fixer_executor, parse_generated_task) .build() ) This is the architectural heart of the system: agents and deterministic executors are first-class nodes in the same graph. Detailed Node-by-Node Mechanics 1. Idea Agent (🧠): Concept Synthesis with Memory Constraints The Idea Agent (agents/k8s_task_idea_agent.py) generates a structured concept with three difficulty variations (BEGINNER/INTERMEDIATE/ADVANCED). It is memory-aware: task_ideas_memory.json tracks successful concepts. task_ideas_failure_memory.json tracks concepts that failed downstream. Memory constraints are injected using AgentMiddleware (system-level prompt injection) to avoid duplicate or previously failed concepts. For Responses-only models, the agent switches to a tool-call contract (save_k8s_task_concept) instead of structured response formatting. 2. Generator Agent (⚙️): MCP-Backed File Authoring The Generator Agent receives a strict prompt and writes task files through MCP filesystem tools. Key framework details: Built through chat_client.as_agent(...). MCP tool attached via tools=mcp_tool. Function-call execution observability added with LoggingFunctionMiddleware. Uses absolute-path-only policy in instructions to prevent path drift. 3. Deterministic Validation + Test (✅): Non-LLM Gates After generation, the graph moves through deterministic executors: run_validation calls pure Python checks (k8s_task_validator). run_pytest executes pytest --import-mode=importlib --rootdir=. .... Raw pytest output is persisted in workflow state for later fixing. This is critical: no LLM is asked whether code is correct. 4. Skip-Answer Test (🧪): Grader Correctness Gate Even if standard tests pass, the workflow enforces a second tier: SKIP_ANSWER_TESTS=True pytest --import-mode=importlib --rootdir=. Implementation detail: the executor writes JUnit XML, parses it, and asserts that: test_03_answer.py is skipped test_05_check.py fails as expected If test_05_check.py does not fail, the task is treated as invalid and sent back to retry/fix. 5. Fixer Agent + Retry Loop (🔧): Bounded Self-Healing On failure, fix_task builds a targeted prompt containing: failure reasons from deterministic nodes full captured pytest output explicit rule to patch only broken files in place The Fixer Agent runs through AgentExecutor, writes patches via MCP, and the graph loops back to parse_generated_task. Retries are stateful (retry_count, max_retries) and hard-bounded. On exhaustion, complete_workflow moves the task to unsuccessful/<game>/ and writes FAILURE_REPORT.txt. Agent Prompt Design (The Part That Makes It Work) If you want to understand why this pipeline works, you need to inspect prompts as operational contracts, not generic instructions. Real Idea Agent Prompt (from code) IDEA_AGENT_INSTRUCTIONS = ( "You are a Kubernetes task idea generator that creates detailed task concepts with three difficulty variations. " "Read official K8s documentation and propose comprehensive learning concepts for a Kubernetes game. " "\n\nYour task:\n" "1. Choose ONE Kubernetes concept not yet covered (check context for existing concepts)\n" "2. Generate exactly 3 variations: BEGINNER, INTERMEDIATE, and ADVANCED\n" "3. Use 3-digit task IDs (001-999) in format: XXX_concept_name_level (e.g., 041_secrets_basic)\n" "4. Each variation should build on the previous one with increasing complexity\n" "5. Include practical, hands-on scenarios covering: Workloads, Services, Storage, Configuration, Security, Scheduling, Policies\n" "\nProvide the concept, tags, description, and 3 variations with task_id, difficulty, title, objective, key_skills, and estimated_time." ) Responses-only models use a stricter tool-call version: IDEA_AGENT_INSTRUCTIONS_TOOL_CALL = ( IDEA_AGENT_INSTRUCTIONS + "\n\n" "**CRITICAL**: You MUST call the save_k8s_task_concept tool to save your generated concept.\n" "...\n" "Always call save_k8s_task_concept with your generated concept." ) Idea Agent Prompt Contract The Idea Agent prompt enforces: one concept per run exactly three difficulty variations strict task ID format (XXX_concept_name_level) practical skill progression Core pattern: You are a Kubernetes task idea generator... 1. Choose ONE Kubernetes concept not yet covered 2. Generate exactly 3 variations: BEGINNER, INTERMEDIATE, ADVANCED 3. Use 3-digit task IDs in format XXX_concept_name_level ... It is strengthened by runtime memory injection: previously generated concepts are blocked previously failed concepts are blocked For Responses-only models, the contract becomes tool-driven: CRITICAL: You MUST call the save_k8s_task_concept tool... This reduces ambiguity in output structure and makes downstream parsing deterministic. Real Generator Agent Prompt (from code) def _get_generator_instructions(): return ( "You are a Kubernetes task generator with filesystem tools.\n" f"The MCP filesystem is rooted at: {PATHS.tests_root.parent}\n" f"You MUST use ABSOLUTE paths for ALL file operations.\n" f"Task directory: {PATHS.game_root}/XXX_task_name/\n" "...\n" "CRITICAL: test_02_ready.py checks resources from setup.template.yaml, NOT answer.template.yaml.\n" "...\n" "MUST use polling loops (60s timeout, 15s interval)\n" "MUST use try/except and safe .get() JSON access\n" ) The generator prompt is long on purpose: it encodes path correctness, file schema, YAML/Jinja structure, and testing strategy in a single deterministic contract. Generator Agent Prompt Contract The Generator prompt is intentionally long and prescriptive because it defines filesystem safety and grading correctness requirements. Key constraints encoded in the prompt: Absolute path writes only (prevents writing to wrong workspace paths) No directory creation (directory is pre-created by executor) Required file set (instruction.md, concept.md, session.json, templates, tests) test-flow invariants: test_01_setup.py deploys setup test_02_ready.py checks setup resources only test_03_answer.py deploys answer test_05_check.py validates final solution robust test coding style: polling loops, try/except, .get()-based JSON parsing, explicit debug output Example contract fragment: CRITICAL PATH RULES: ✅ CORRECT: /abs/path/tests/gameXX/050_task/file.py ❌ WRONG: tests/gameXX/050_task/file.py (relative) CRITICAL: test_02_ready.py checks resources from setup.template.yaml, NOT answer.template.yaml. This is why generation quality is high before the Fixer loop even starts. Real Runtime Retry Prompt Builder (from code) def _build_retry_generation_prompt(combined: CombinedValidationResult) -> str: task_id = combined.test.task_id failure_reasons = _build_failure_reasons(combined) return ( f"Generate a complete Kubernetes learning task with ID '{task_id}' about '{combined.target_topic}'. " f"This is retry attempt {combined.retry_count + 1} of {combined.max_retries}. " f"\n\n⚠️ PREVIOUS ATTEMPT FAILED:" f"\n{chr(10).join([f' - {reason}' for reason in failure_reasons])}" f"\n\nIMPORTANT: You MUST use the exact task ID '{task_id}' - do not generate a new ID." f"\n\n✅ Directory already exists: {PATHS.game_root}/{task_id}/" f"\nWrite all files directly into this directory. Do NOT call create_directory." "..." ) This means retries are not generic retries; they are failure-conditioned retries with precise constraints. Fixer Agent Prompt Contract The Fixer prompt is a repair protocol, not a regeneration prompt. It includes: exact failure reasons from deterministic validators raw pytest output instruction to read current task files first strict directive to patch only broken files Core behavior constraints: DO NOT rewrite all files. Make TARGETED FIXES to ONLY the broken files. Use ABSOLUTE paths for all file operations. This keeps retries cheap, preserves working artifacts, and improves convergence speed. Real Runtime Fix Prompt Builder (from code) def _build_fix_prompt(combined: CombinedValidationResult, raw_test_output: str) -> str: task_id = combined.test.task_id failure_reasons = _build_failure_reasons(combined) prompt = ( f"Fix the failed Kubernetes task '{task_id}' located in '{PATHS.game_root}/{task_id}/'." f"\n\nThis is fix attempt {combined.retry_count + 1} of {combined.max_retries}." f"\n\n⚠️ TASK FAILED WITH THESE ERRORS:" f"\n{chr(10).join([f' - {reason}' for reason in failure_reasons])}" ) if raw_test_output: prompt += f"\n\n📋 FULL TEST OUTPUT:\n```\n{raw_test_output}\n```" prompt += ( f"\n\n🔍 YOUR TASK:" f"\n1. READ all files from '{PATHS.game_root}/{task_id}/'" f"\n6. Make TARGETED FIXES to ONLY the broken files" f"\n7. WRITE ONLY the fixed files back" f"\n\n⚠️ CRITICAL: DO NOT rewrite all files! Only fix the broken ones!" ) return prompt How Prompt Output Enters the Agent Framework Graph The prompt builders above are used by deterministic executors and sent to agent nodes through AgentExecutorRequest: await ctx.send_message( AgentExecutorRequest( messages=[Message(role="user", contents=[fix_prompt])], should_respond=True ) ) So prompt generation and graph routing are tightly coupled: each route transition emits a specific prompt payload into the next LLM node. Runtime-Constructed Prompts in Executors The most important prompts are built dynamically in workflow executors: _build_retry_generation_prompt(...) _build_fix_prompt(...) These functions inject live context: retry_count / max_retries concept + objective metadata validation/test failure reasons full captured test logs So each retry is context-rich and specific, not another blind generation attempt. Prompt + Middleware + Deterministic Gates = Reliability In this repository, reliability does not come from prompt text alone. It comes from three layers working together: Prompt contracts constrain agent behavior. Middleware injects memory and logs tool invocations. Deterministic executors enforce objective pass/fail gates. That combination is why the workflow remains auditable and predictable even when LLM outputs vary. Agent Framework Execution Model in This Repo Strongly-Typed Message Passing workflow/models.py defines transport models used between nodes: ValidationResult and TestResult (Pydantic) CombinedValidationResult (dataclass with should_keep and should_retry) InitialWorkflowState (seed payload for each run) This keeps node contracts explicit and simplifies selector logic. Fail-Fast Shared State Management Executors use ctx.get_state(...) with a sentinel (_MISSING) and raise explicit exceptions if required state is absent. This prevents hidden fallback behavior and catches graph/data wiring errors early. Conditional Routing with Selectors Selectors (workflow/selectors.py) encode graph decisions: select_action → keep vs remove select_skip_answer_action → complete vs loop select_loop_action → fix vs complete This separates decision policy from executor implementation. Streaming Workflow Runtime workflow.run(initial_state, stream=True) emits output events incrementally. The runner (workflow/runner.py) consumes these events to detect successful completions and update concept memory accordingly. Agent Construction and API Selection Strategy The repository uses Azure CLI auth (AzureCliCredential) and dynamically selects API mode by deployment name (agents/config.py): Chat Completions path: OpenAIChatCompletionClient Responses-only model path: OpenAIChatClient or custom ResponsesAgent Why this matters: some codex-class deployments are Responses-only, so the architecture supports both without changing workflow logic. How MCP Actually Controls Kubernetes (Important Distinction) In this repo, MCP is used for filesystem control; Kubernetes control is done through kubectl tools. 1) MCP server role: controlled file I/O The workflow starts MCP stdio servers (official filesystem server) and mounts them into agents: docs_mcp_tool = MCPStdioTool( name="filesystem_docs", command="npx", args=["-y", "@modelcontextprotocol/server-filesystem", str(PATHS.k8s_docs_root)], load_prompts=False, ) tests_mcp_tool = MCPStdioTool( name="filesystem_tests", command="npx", args=["-y", "@modelcontextprotocol/server-filesystem", str(PATHS.tests_root.parent)], load_prompts=False, ) Those MCP tools are passed into Generator/Fixer agents, which then call MCP file functions (read/write/list) inside allowed roots only. 2) Kubernetes cluster control role: kubectl execution tool Cluster actions are not performed by MCP filesystem server; they are performed by a dedicated function tool: def run_kubectl_command(command: str) -> str: kubeconfig_path = os.environ.get("KUBECONFIG", "/home/developer/.kube/config") cmd_list = ["kubectl"] + command.split() result = subprocess.run( cmd_list, capture_output=True, text=True, check=True, env={**os.environ, "KUBECONFIG": kubeconfig_path}, ) return result.stdout And the Kubernetes agent forces tool usage: agent = responses_client.as_agent( name="KubernetesAgent", instructions="...You MUST use the run_kubectl_command tool...", tools=[run_kubectl_command], default_options={"tool_choice": "required"}, ) So the control plane is: MCP filesystem → manipulate generated task files. kubectl tool → query/mutate real cluster state. deterministic pytest/validator executors → accept or reject results. 3) End-to-end command flow in practice When generated tests run, they execute real kubectl get ... -o json checks in test code, and the deterministic runner captures raw output: pytest_command = f"pytest --import-mode=importlib --rootdir=. {task_with_val.task_directory}/" result = run_pytest_command(pytest_command) ctx.set_state(f"raw_output_{task_with_val.task_id}", raw_output) This means Kubernetes state verification is always grounded in live command output, not model speculation. Should MCP run Kubernetes tests? Short answer: not in this design. Current architecture keeps test execution deterministic and local: pytest is run by run_pytest_command(...) (pure Python subprocess runner) test results are parsed and stored in workflow state retry/fix routing uses those deterministic outputs This is intentional. If test execution were delegated to an LLM-facing MCP command tool, you would lose strict control over execution semantics and error handling. Recommended pattern: Use MCP for file/document access and controlled editing. Use deterministic executors for pytest and validation. Use LLM agents only for generation and repair. If you still want MCP-driven test execution, add a separate locked-down command MCP server (only whitelisted pytest/kubectl commands), but keep pass/fail decision logic in deterministic executors. How tests are run in this workflow (with code) The workflow executes tests in deterministic executors, not inside LLM agents. 1) Workflow node calls pytest runner run_pytest executor builds the command and calls the pure Python runner: @executor(id="run_pytest") async def run_pytest(task_with_val: TaskWithValidation, ctx: WorkflowContext[TestResult]) -> None: from agents.pytest_runner import run_pytest_command pytest_command = f"pytest --import-mode=importlib --rootdir=. {task_with_val.task_directory}/" result = run_pytest_command(pytest_command) raw_output = result["details"][0] if result.get("details") else "" ctx.set_state(f"raw_output_{task_with_val.task_id}", raw_output) ... 2) Deterministic subprocess execution The runner normalizes command flags and executes pytest via subprocess: def run_pytest_command(command: str) -> dict[str, Any]: normalized_command = _normalize_pytest_command(command) # adds -s if needed cmd_list = shlex.split(normalized_command) result = subprocess.run( cmd_list, capture_output=True, text=True, check=False, cwd=str(PATHS.pytest_rootdir), ) combined_output = result.stdout + "\n" + result.stderr _save_test_output(normalized_command, combined_output, skip_answer) ... Exit codes are interpreted deterministically: 0 → pass 5 → no tests collected (fail) others → fail with exit code reason 3) Skip-answer validation tier After normal pass, the workflow runs pytest again with SKIP_ANSWER_TESTS=True and parses JUnit XML: os.environ["SKIP_ANSWER_TESTS"] = "True" pytest_command = f"pytest --import-mode=importlib --rootdir=. --junitxml={junit_path} {task_dir}/" result = run_pytest_command(pytest_command) test_05_failed, test_03_skipped = _parse_skip_answer_junit(junit_path) The parser checks per-testcase outcomes: if "test_05_check.py" in context and has_failure_or_error: test_05_failed = True if "test_03_answer.py" in context and has_skipped: test_03_skipped = True 4) How failures trigger fix loop If pytest fails (or skip-answer logic fails), failure reasons and raw output are pushed into state, then the Fixer Agent receives a generated fix prompt containing that output: ctx.set_state(f"failure_reasons_{task_id}", reasons) ctx.set_state(f"raw_output_{task_id}", raw_output) fix_prompt = _build_fix_prompt(combined, raw_test_output) await ctx.send_message( AgentExecutorRequest( messages=[Message(role="user", contents=[fix_prompt])], should_respond=True ) ) That is the key loop: deterministic test output drives LLM repair, then deterministic tests re-run. ResponsesAgent Internals (Advanced Agent Framework Pattern) The custom ResponsesAgent (agents/responses_agent.py) demonstrates a lower-level integration pattern: Connect MCP tool lazily. Call Responses API. Parse ResponseFunctionToolCall items. Execute tools (MCP + custom callables). Feed function_call_output back to model. Repeat until final text response. It also runs a middleware chain around tool invocations, preserving observability and consistency with standard agent paths. Why This Architecture Is Robust This design works because Agent Framework is used as a deterministic orchestration layer around probabilistic generation: LLM creativity is constrained by typed state and strict prompts. deterministic executors act as objective quality gates. retries are targeted, bounded, and auditable. failures produce durable forensic artifacts (FAILURE_REPORT.txt + test logs). For Kubernetes education pipelines, this yields high throughput without sacrificing grader reliability. GitHub Repo - https://github.com/wongcyrus/k8s-game-rule-builder About the Author Cyrus Wong is the senior lecturer of Hong Kong Institute of Information Technology (HKIIT) @ IVE(Lee Wai Lee).and he focuses on teaching public Cloud technologies. He is a passionate advocate for the adoption of cloud technology across various media and events. With his extensive knowledge and expertise, he has earned prestigious recognitions such as AWS AI Hero, Microsoft MVP- Microsoft Foundry, and Google Developer Expert - Cloud(AI).
cyruswong
Jul 21, 2026 Place Educator Developer Blog
97Views
0likes
0Comments
Fine-Tuning and Deploying Phi-3.5 Model with Azure and AI Toolkit
What is Phi-3.5? Phi-3.5 as a state-of-the-art language model with strong multilingual capabilities. Emphasize that it is designed to handle multiple languages with high proficiency, making it a versatile tool for Natural Language Processing (NLP) tasks across different linguistic backgrounds. Key Features of Phi-3.5 Highlight the core features of the Phi-3.5 model: Multilingual Capabilities: Explain that the model supports a wide variety of languages, including major world languages such as English, Spanish, Chinese, French, and others. You can provide an example of its ability to handle a sentence or document translation from one language to another without losing context or meaning. Fine-Tuning Ability: Discuss how the model can be fine-tuned for specific use cases. For instance, in a customer support setting, the Phi-3.5 model can be fine-tuned to understand the nuances of different languages used by customers across the globe, improving response accuracy. High Performance in NLP Tasks: Phi-3.5 is optimized for tasks like text classification, machine translation, summarization, and more. It has superior performance in handling large-scale datasets and producing coherent, contextually correct language outputs. Applications in Real-World Scenarios To make this section more engaging, provide a few real-world applications where the Phi-3.5 model can be utilized: Customer Support Chatbots: For companies with global customer bases, the model’s multilingual support can enhance chatbot capabilities, allowing for real-time responses in a customer’s native language, no matter where they are located. Content Creation for Global Markets: Discuss how businesses can use Phi-3.5 to automatically generate or translate content for different regions. For example, marketing copy can be adapted to fit cultural and linguistic nuances in multiple languages. Document Summarization Across Languages: Highlight how the model can be used to summarize long documents or articles written in one language and then translate the summary into another language, improving access to information for non-native speakers. Why Choose Phi-3.5 for Your Project? End this section by emphasizing why someone should use Phi-3.5: Versatility: It’s not limited to just one or two languages but performs well across many. Customization: The ability to fine-tune it for particular use cases or industries makes it highly adaptable. Ease of Deployment: With tools like Azure ML and Ollama, deploying Phi-3.5 in the cloud or locally is accessible even for smaller teams. Objective Of Blog Specialized Language Models (SLMs) are at the forefront of advancements in Natural Language Processing, offering fine-tuned, high-performance solutions for specific tasks and languages. Among these, the Phi-3.5 model has emerged as a powerful tool, excelling in its multilingual capabilities. Whether you're working with English, Spanish, Mandarin, or any other major world language, Phi-3.5 offers robust, reliable language processing that adapts to various real-world applications. This makes it an ideal choice for businesses looking to deploy multilingual chatbots, automate content generation, or translate customer interactions in real time. Moreover, its fine-tuning ability allows for customization, making Phi-3.5 versatile across industries and tasks. Customization and Fine-Tuning for Different Applications The Phi-3.5 model is not just limited to general language understanding tasks. It can be fine-tuned for specific applications, industries, and language models, allowing users to tailor its performance to meet their needs. Customizable for Industry-Specific Use Cases: With fine-tuning, the model can be trained further on domain-specific data to handle particular use cases like legal document translation, medical records analysis, or technical support. Example: A healthcare company can fine-tune Phi-3.5 to understand medical terminology in multiple languages, enabling it to assist in processing patient records or generating multilingual health reports. Adapting for Specialized Tasks: You can train Phi-3.5 to perform specialized tasks like sentiment analysis, text summarization, or named entity recognition in specific languages. Fine-tuning helps enhance the model's ability to handle unique text formats or requirements. Example: A marketing team can fine-tune the model to analyse customer feedback in different languages to identify trends or sentiment across various regions. The model can quickly classify feedback as positive, negative, or neutral, even in less widely spoken languages like Arabic or Korean. Applications in Real-World Scenarios To illustrate the versatility of Phi-3.5, here are some real-world applications where this model excels, demonstrating its multilingual capabilities and customization potential: Case Study 1: Multilingual Customer Support Chatbots Many global companies rely on chatbots to handle customer queries in real-time. With Phi-3.5’s multilingual abilities, businesses can deploy a single model that understands and responds in multiple languages, cutting down on the need to create language-specific chatbots. Example: A global airline can use Phi-3.5 to power its customer service bot. Passengers from different countries can inquire about their flight status or baggage policies in their native languages—whether it's Japanese, Hindi, or Portuguese—and the model responds accurately in the appropriate language. Case Study 2: Multilingual Content Generation Phi-3.5 is also useful for businesses that need to generate content in different languages. For example, marketing campaigns often require creating region-specific ads or blog posts in multiple languages. Phi-3.5 can help automate this process by generating localized content that is not just translated but adapted to fit the cultural context of the target audience. Example: An international cosmetics brand can use Phi-3.5 to automatically generate product descriptions for different regions. Instead of merely translating a product description from English to Spanish, the model can tailor the description to fit cultural expectations, using language that resonates with Spanish-speaking audiences. Case Study 3: Document Translation and Summarization Phi-3.5 can be used to translate or summarize complex documents across languages. Its ability to preserve meaning and context across languages makes it ideal for industries where accuracy is crucial, such as legal or academic fields. Example: A legal firm working on cross-border cases can use Phi-3.5 to translate contracts or legal briefs from German to English, ensuring the context and legal terminology are accurately preserved. It can also summarize lengthy documents in multiple languages, saving time for legal teams. Fine-Tuning Phi-3.5 Model Fine-tuning a language model like Phi-3.5 is a crucial step in adapting it to perform specific tasks or cater to specific domains. This section will walk through what fine-tuning is, its importance in NLP, and how to fine-tune the Phi-3.5 model using Azure Model Catalog for different languages and tasks. We'll also explore a code example and best practices for evaluating and validating the fine-tuned model. What is Fine-Tuning? Fine-tuning refers to the process of taking a pre-trained model and adapting it to a specific task or dataset by training it further on domain-specific data. In the context of NLP, fine-tuning is often required to ensure that the language model understands the nuances of a particular language, industry-specific terminology, or a specific use case. Why Fine-Tuning is Necessary Pre-trained Large Language Models (LLMs) are trained on diverse datasets and can handle various tasks like text summarization, generation, and question answering. However, they may not perform optimally in specialized domains without fine-tuning. The goal of fine-tuning is to enhance the model's performance on specific tasks by leveraging its prior knowledge while adapting it to new contexts. Challenges of Fine-Tuning Resource Intensiveness: Fine-tuning large models can be computationally expensive, requiring significant hardware resources. Storage Costs: Each fine-tuned model can be large, leading to increased storage needs when deploying multiple models for different tasks. LoRA and QLoRA To address these challenges, techniques like LoRA (Low-rank Adaptation) and QLoRA (Quantized Low-rank Adaptation) have emerged. Both methods aim to make the fine-tuning process more efficient: LoRA: This technique reduces the number of trainable parameters by introducing low-rank matrices into the model while keeping the original model weights frozen. This approach minimizes memory usage and speeds up the fine-tuning process. QLoRA: An enhancement of LoRA, QLoRA incorporates quantization techniques to further reduce memory requirements and increase the efficiency of the fine-tuning process. It allows for the deployment of large models on consumer hardware without the extensive resource demands typically associated with full fine-tuning. from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments from peft import get_peft_model, LoraConfig # Load a pre-trained model model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") # Configure LoRA lora_config = LoraConfig( r=16, # Rank lora_alpha=32, lora_dropout=0.1, ) # Wrap the model with LoRA model = get_peft_model(model, lora_config) # Define training arguments training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, ) # Create a Trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, ) # Start fine-tuning trainer.train() This code outlines how to set up a model for fine-tuning using LoRA, which can significantly reduce the resource requirements while still adapting the model effectively to specific tasks. In summary, fine-tuning with methods like LoRA and QLoRA is essential for optimizing pre-trained models for specific applications in NLP, making it feasible to deploy these powerful models in various domains efficiently. Why is Fine-Tuning Important in NLP? Task-Specific Performance: Fine-tuning helps improve performance for tasks like text classification, machine translation, or sentiment analysis in specific domains (e.g., legal, healthcare). Language-Specific Adaptation: Since models like Phi-3.5 are trained on general datasets, fine-tuning helps them handle industry-specific jargon or linguistic quirks. Efficient Resource Utilization: Instead of training a model from scratch, fine-tuning leverages pre-trained knowledge, saving computational resources and time. Steps to Fine-Tune Phi-3.5 in Azure AI Foundry Fine-tuning the Phi-3.5 model in Azure AI Foundry involves several key steps. Azure provides a user-friendly interface to streamline model customization, allowing you to quickly configure, train, and deploy models. Step 1: Setting Up the Environment in Azure AI Foundry Access Azure AI Foundry: Log in to Azure AI Foundry. If you don’t have an account, you can create one and set up a workspace. Create a New Experiment: Once in the Azure AI Foundry, create a new training experiment. Choose the Phi-3.5 model from the pre-trained models provided in the Azure Model Zoo. Set Up the Data for Fine-Tuning: Upload your custom dataset for fine-tuning. Ensure the dataset is in a compatible format (e.g., CSV, JSON). For instance, if you are fine-tuning the model for a customer service chatbot, you could upload customer queries in different languages. Step 2: Configure Fine-Tuning Settings Select the Training Dataset: Select the dataset you uploaded and link it to the Phi-3.5 model. 2) Configure the Hyperparameters: Set up training hyperparameters like the number of epochs, learning rate, and batch size. You may need to experiment with these settings to achieve optimal performance. 3) Choose the Task Type: Specify the task you are fine-tuning for, such as text classification, translation, or summarization. This helps Azure AI Foundry understand how to optimize the model during fine-tuning. 4) Fine-Tuning for Specific Languages: If you are fine-tuning for a specific language or multilingual tasks, ensure that the dataset is labeled appropriately and contains enough examples in the target language(s). This will allow Phi-3.5 to learn language-specific features effectively. Step 3: Train the Model Launch the Training Process: Once the configuration is complete, launch the training process in Azure AI Foundry. Depending on the size of your dataset and the complexity of the model, this could take some time. Monitor Training Progress: Use Azure AI Foundry’s built-in monitoring tools to track performance metrics such as loss, accuracy, and F1 score. You can view the model’s progress during training to ensure that it is learning effectively. Code Example: Fine-Tuning Phi-3.5 for a Specific Use Case Here's a code snippet for fine-tuning the Phi-3.5 model using Python and Azure AI Foundry SDK. In this example, we are fine-tuning the model for a customer support chatbot in multiple languages. from azure.ai import Foundry from azure.ai.model import Model # Initialize Azure AI Foundry foundry = Foundry() # Load the Phi-3.5 model model = Model.load("phi-3.5") # Set up the training dataset training_data = foundry.load_dataset("customer_queries_dataset") # Fine-tune the model model.fine_tune(training_data, epochs=5, learning_rate=0.001) # Save the fine-tuned model model.save("fine_tuned_phi_3.5") Best Practices for Evaluating and Validating Fine-Tuned Models Once the model is fine-tuned, it's essential to evaluate and validate its performance before deploying it in production. Split Data for Validation: Always split your dataset into training and validation sets. This ensures that the model is evaluated on unseen data to prevent overfitting. Evaluate Key Metrics: Measure performance using key metrics such as: Accuracy: The proportion of correct predictions. F1 Score: A measure of precision and recall. Confusion Matrix: Helps visualize true vs. false predictions for classification tasks. Cross-Language Validation: If the model is fine-tuned for multiple languages, test its performance across all supported languages to ensure consistency and accuracy. Test in Production-Like Environments: Before full deployment, test the fine-tuned model in a production-like environment to catch any potential issues. Continuous Monitoring and Re-Fine-Tuning: Once deployed, continuously monitor the model’s performance and re-fine-tune it periodically as new data becomes available. Deploying Phi-3.5 Model After fine-tuning the Phi-3.5 model, the next crucial step is deploying it to make it accessible for real-world applications. This section will cover two key deployment strategies: deploying in Azure for cloud-based scaling and reliability, and deploying locally with AI Toolkit for simpler offline usage. Each deployment strategy offers its own advantages depending on the use case. Deploying in Azure Azure provides a powerful environment for deploying machine learning models at scale, enabling organizations to deploy models like Phi-3.5 with high availability, scalability, and robust security features. Azure AI Foundry simplifies the entire deployment pipeline. Set Up Azure AI Foundry Workspace: Log in to Azure AI Foundry and navigate to the workspace where the Phi-3.5 model was fine-tuned. Go to the Deployments section and create a new deployment environment for the model. Choose Compute Resources: Compute Target: Select a compute target suitable for your deployment. For large-scale usage, it’s advisable to choose a GPU-based compute instance. Example: Choose an Azure Kubernetes Service (AKS) cluster for handling large-scale requests efficiently. Configure Scaling Options: Azure allows you to set up auto-scaling based on traffic. This ensures that the model can handle surges in demand without affecting performance. Model Deployment Configuration: Create an Inference Pipeline: In Azure AI Foundry, set up an inference pipeline for your model. Specify the Model: Link the fine-tuned Phi-3.5 model to the deployment pipeline. Deploy the Model: Select the option to deploy the model to the chosen compute resource. Test the Deployment: Once the model is deployed, test the endpoint by sending sample requests to verify the predictions. Configuration Steps (Compute, Resources, Scaling) During deployment, Azure AI Foundry allows you to configure essential aspects like compute type, resource allocation, and scaling options. Compute Type: Choose between CPU or GPU clusters depending on the computational intensity of the model. Resource Allocation: Define the minimum and maximum resources to be allocated for the deployment. For real-time applications, use Azure Kubernetes Service (AKS) for high availability. For batch inference, Azure Container Instances (ACI) is suitable. Auto-Scaling: Set up automatic scaling of the compute instances based on the number of requests. For example, configure the deployment to start with 1 node and scale to 10 nodes during peak usage. Cost Comparison: Phi-3.5 vs. Larger Language Models When comparing the costs of using Phi-3.5 with larger language models (LLMs), several factors come into play, including computational resources, pricing structures, and performance efficiency. Here’s a breakdown: Cost Efficiency Phi-3.5: Designed as a Small Language Model (SLM), Phi-3.5 is optimized for lower computational costs. It offers competitive performance at a fraction of the cost of larger models, making it suitable for budget-conscious projects. The smaller size (3.8 billion parameters) allows for reduced resource consumption during both training and inference. Larger Language Models (e.g., GPT-3.5): Typically require more computational resources, leading to higher operational costs. Larger models may incur additional costs for storage and processing power, especially in cloud environments. Performance vs. Cost Performance Parity: Phi-3.5 has been shown to achieve performance parity with larger models on various benchmarks, including language comprehension and reasoning tasks. This means that for many applications, Phi-3.5 can deliver similar results to larger models without the associated costs. Use Case Suitability: For simpler tasks or applications that do not require extensive factual knowledge, Phi-3.5 is often the more cost-effective choice. Larger models may still be preferred for complex tasks requiring deep contextual understanding or extensive factual recall. Pricing Structure Azure Pricing: Phi-3.5 is available through Azure with a pay-as-you-go billing model, allowing users to scale costs based on usage. Pricing details for Phi-3.5 can be found on the Azure pricing page, where users can customize options based on their needs. Code Example: API Setup and Endpoints for Live Interaction Below is a Python code snippet demonstrating how to interact with a deployed Phi-3.5 model via an API in Azure: import requests # Define the API endpoint and your API key api_url = "https://<your-azure-endpoint>/predict" api_key = "YOUR_API_KEY" # Prepare the input data input_data = { "text": "What are the benefits of renewable energy?" } # Make the API request response = requests.post(api_url, json=input_data, headers={"Authorization": f"Bearer {api_key}"}) # Print the model's response if response.status_code == 200: print("Model Response:", response.json()) else: print("Error:", response.status_code, response.text) Deploying Locally with AI Toolkit For developers who prefer to run models on their local machines, the AI Toolkit provides a convenient solution. The AI Toolkit is a lightweight platform that simplifies local deployment of AI models, allowing for offline usage, experimentation, and rapid prototyping. Deploying the Phi-3.5 model locally using the AI Toolkit is straightforward and can be used for personal projects, testing, or scenarios where cloud access is limited. Introduction to AI Toolkit The AI Toolkit is an easy-to-use platform for deploying language models locally without relying on cloud infrastructure. It supports a range of AI models and enables developers to work in a low-latency environment. Advantages of deploying locally with AI Toolkit: Offline Capability: No need for continuous internet access. Quick Experimentation: Rapid prototyping and testing without the delays of cloud deployments. Setup Guide: Installing and Running Phi-3.5 Locally Using AI Toolkit Install AI Toolkit: Go to the AI Toolkit website and download the platform for your operating system (Linux, macOS, or Windows). Install AI Toolkit by running the appropriate installation command in your terminal. Download the Phi-3.5 Model: Once AI Toolkit is installed, you can download the Phi-3.5 model locally by running: 3. Run the Model Locally: After downloading the model, start a local session by running: This will launch a local server on your machine where the model will be available for interaction. Code Example: Using Phi-3.5 Locally in a Project Below is a Python code example demonstrating how to send a query to the locally deployed Phi-3.5 model running on the AI Toolkit. import requests # Define the local endpoint local_url = "http://localhost:8000/predict" # Prepare the input data input_data = { "text": "What are the benefits of renewable energy?" } # Make the API request response = requests.post(local_url, json=input_data) # Print the model's response if response.status_code == 200: print("Model Response:", response.json()) else: print("Error:", response.status_code, response.text) Comparing Language Capabilities Test Results: How Phi-3.5 Handles Different Languages The Phi-3.5 model demonstrates robust multilingual capabilities, effectively processing and generating text in various languages. Below are comparative examples showcasing its performance in English, Spanish, and Mandarin: English Example: Input: "What are the benefits of renewable energy?" Output: "Renewable energy sources, such as solar and wind, reduce greenhouse gas emissions and promote sustainability." Spanish Example: Input: "¿Cuáles son los beneficios de la energía renovable?" Output: "Las fuentes de energía renovable, como la solar y la eólica, reducen las emisiones de gases de efecto invernadero y promueven la sostenibilidad." Mandarin Example: Input: "可再生能源的好处是什么？" Output: "可再生能源，如太阳能和风能，减少温室气体排放，促进可持续发展。" Performance Benchmarking and Evaluation Across Different Languages Benchmarking Phi-3.5 across different languages involves evaluating its accuracy, fluency, and contextual understanding. For instance, using BLEU scores and human evaluations, the model can be assessed on its translation quality and coherence in various languages. Real-World Use Case: Multilingual Customer Service Chatbot A practical application of Phi-3.5's multilingual capabilities is in developing a customer service chatbot that can interact with users in their preferred language. For instance, the chatbot could provide support in English, Spanish, and Mandarin, ensuring a wider reach and better user experience. Optimizing and Validating Phi-3.5 Model Model Performance Metrics To validate the model's performance in different scenarios, consider the following metrics: Accuracy: Measure how often the model's outputs are correct or align with expected results. Fluency: Assess the naturalness and readability of the generated text. Contextual Understanding: Evaluate how well the model understands and responds to context-specific queries. Tools to Use in Azure and Ollama for Evaluation Azure Cognitive Services: Utilize tools like Text Analytics and Translator to evaluate performance. Ollama: Use local testing environments to quickly iterate and validate model outputs. Conclusion In summary, Phi-3.5 exhibits impressive multilingual capabilities, effective deployment options, and robust performance metrics. Its ability to handle various languages makes it a versatile tool for natural language processing applications. Phi-3.5 stands out for its adaptability and performance in multilingual contexts, making it an excellent choice for future NLP projects, especially those requiring diverse language support. We encourage readers to experiment with the Phi-3.5 model using Azure AI Foundry or the AI Toolkit, explore fine-tuning techniques for their specific use cases, and share their findings with the community. For more information on optimized fine-tuning techniques, check out the Ignite Fine-Tuning Workshop. References Customize the Phi-3.5 family of models with LoRA fine-tuning in Azure Fine-tune Phi-3.5 models in Azure Fine Tuning with Azure AI Foundry and Microsoft Olive Hands on Labs and Workshop Customize a model with fine-tuning https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=azure-openai%2Cturbo%2Cpython-new&pivots=programming-language-studio Microsoft AI Toolkit - AI Toolkit for VSCode
Sharda_Kaur
May 25, 2026 Place Educator Developer Blog
1.9KViews
1like
2Comments
Unleashing the Power of Model Context Protocol (MCP): A Game-Changer in AI Integration
Artificial Intelligence is evolving rapidly, and one of the most pressing challenges is enabling AI models to interact effectively with external tools, data sources, and APIs. The Model Context Protocol (MCP) solves this problem by acting as a bridge between AI models and external services, creating a standardized communication framework that enhances tool integration, accessibility, and AI reasoning capabilities. What is Model Context Protocol (MCP)? MCP is a protocol designed to enable AI models, such as Azure OpenAI models, to interact seamlessly with external tools and services. Think of MCP as a universal USB-C connector for AI, allowing language models to fetch information, interact with APIs, and execute tasks beyond their built-in knowledge. Key Features of MCP Standardized Communication – MCP provides a structured way for AI models to interact with various tools. Tool Access & Expansion – AI assistants can now utilize external tools for real-time insights. Secure & Scalable – Enables safe and scalable integration with enterprise applications. Multi-Modal Integration – Supports STDIO, SSE (Server-Sent Events), and WebSocket communication methods. MCP Architecture & How It Works MCP follows a client-server architecture that allows AI models to interact with external tools efficiently. Here’s how it works: Components of MCP MCP Host – The AI model (e.g., Azure OpenAI GPT) requesting data or actions. MCP Client – An intermediary service that forwards the AI model's requests to MCP servers. MCP Server – Lightweight applications that expose specific capabilities (APIs, databases, files, etc.). Data Sources – Various backend systems, including local storage, cloud databases, and external APIs. Data Flow in MCP The AI model sends a request (e.g., "fetch user profile data"). The MCP client forwards the request to the appropriate MCP server. The MCP server retrieves the required data from a database or API. The response is sent back to the AI model via the MCP client. Integrating MCP with Azure OpenAI Services Microsoft has integrated MCP with Azure OpenAI Services, allowing GPT models to interact with external services and fetch live data. This means AI models are no longer limited to static knowledge but can access real-time information. Benefits of Azure OpenAI Services + MCP Integration ✔ Real-time Data Fetching – AI assistants can retrieve fresh information from APIs, databases, and internal systems. ✔ Contextual AI Responses – Enhances AI responses by providing accurate, up-to-date information. ✔ Enterprise-Ready – Secure and scalable for business applications, including finance, healthcare, and retail. Hands-On Tools for MCP Implementation To implement MCP effectively, Microsoft provides two powerful tools: Semantic Workbench and AI Gateway. Microsoft Semantic Workbench A development environment for prototyping AI-powered assistants and integrating MCP-based functionalities. Features: Build and test multi-agent AI assistants. Configure settings and interactions between AI models and external tools. Supports GitHub Codespaces for cloud-based development. Explore Semantic Workbench Workbench interface examples Microsoft AI Gateway A plug-and-play interface that allows developers to experiment with MCP using Azure API Management. Features: Credential Manager – Securely handle API credentials. Live Experimentation – Test AI model interactions with external tools. Pre-built Labs – Hands-on learning for developers. Explore AI Gateway Setting Up MCP with Azure OpenAI Services Step 1: Create a Virtual Environment First, create a virtual environment using Python: python -m venv .venv Activate the environment: # Windows venv\Scripts\activate # MacOS/Linux source .venv/bin/activate Step 2: Install Required Libraries Create a requirements.txt file and add the following dependencies: langchain-mcp-adapters langgraph langchain-openai Then, install the required libraries: pip install -r requirements.txt Step 3: Set Up OpenAI API Key Ensure you have your OpenAI API key set up: # Windows setx OPENAI_API_KEY "<your_api_key> # MacOS/Linux export OPENAI_API_KEY=<your_api_key> Building an MCP Server This server performs basic mathematical operations like addition and multiplication. Create the Server File First, create a new Python file: touch math_server.py Then, implement the server: from mcp.server.fastmcp import FastMCP # Initialize the server mcp = FastMCP("Math") MCP.tool() def add(a: int, b: int) -> int: return a + b MCP.tool() def multiply(a: int, b: int) -> int: return a * b if __name__ == "__main__": mcp.run(transport="stdio") Your MCP server is now ready to run. Building an MCP Client This client connects to the MCP server and interacts with it. Create the Client File First, create a new file: touch client.py Then, implement the client: import asyncio from mcp import ClientSession, StdioServerParameters from langchain_openai import ChatOpenAI from mcp.client.stdio import stdio_client # Define server parameters server_params = StdioServerParameters( command="python", args=["math_server.py"], ) # Define the model model = ChatOpenAI(model="gpt-4o") async def run_agent(): async with stdio_client(server_params) as (read, write): async with ClientSession(read, write) as session: await session.initialize() tools = await load_mcp_tools(session) agent = create_react_agent(model, tools) agent_response = await agent.ainvoke({"messages": "what's (4 + 6) x 14?"}) return agent_response["messages"][3].content if __name__ == "__main__": result = asyncio.run(run_agent()) print(result) Your client is now set up and ready to interact with the MCP server. Running the MCP Server and Client Step 1: Start the MCP Server Open a terminal and run: python math_server.py This starts the MCP server, making it available for client connections. Step 2: Run the MCP Client In another terminal, run: python client.py Expected Output 140 This means the AI agent correctly computed (4 + 6) x 14 using both the MCP server and GPT-4o. Conclusion Integrating MCP with Azure OpenAI Services enables AI applications to securely interact with external tools, enhancing functionality beyond text-based responses. With standardized communication and improved AI capabilities, developers can build smarter and more interactive AI-powered solutions. By following this guide, you can set up an MCP server and client, unlocking the full potential of AI with structured external interactions. Next Steps: Explore more MCP tools and integrations. Extend your MCP setup to work with additional APIs. Deploy your solution in a cloud environment for broader accessibility. For further details, visit the GitHub repository for MCP integration examples and best practices. MCP GitHub Repository MCP Documentation Semantic Workbench AI Gateway MCP Video Walkthrough MCP Blog MCP Github End to End Demo
Sharda_Kaur
May 25, 2026 Place Educator Developer Blog
63KViews
11likes
6Comments
Understanding Azure OpenAI Service Quotas and Limits: A Beginner-Friendly Guide
Azure OpenAI Service allows developers, researchers, and students to integrate powerful AI models like GPT-4, GPT-3.5, and DALL·E into their applications. But with great power comes great responsibility and limits. Before you dive into building your next AI-powered solution, it's crucial to understand how quotas and limits work in the Azure OpenAI ecosystem. This guide is designed to help students and beginners easily understand the concept of quotas, limits, and how to manage them effectively. What Are Quotas and Limits? Think of Azure's quotas as your "AI data pack." It defines how much you can use the service. Meanwhile, limits are hard boundaries set by Azure to ensure fair use and system stability. Quota The maximum number of resources (e.g., tokens, requests) allocated to your Azure subscription. Limit The technical cap imposed by Azure on specific resources (e.g., number of files, deployments). Key Metrics: TPM & RPM Tokens Per Minute (TPM) TPM refers to how many tokens you can use per minute across all your requests in each region. A token is a chunk of text. For example, the word "Hello" is 1 token, but "Understanding" might be 2 tokens. Each model has its own default TPM. Example: GPT-4 might allow 240,000 tokens per minute. You can split this quota across multiple deployments. Requests Per Minute (RPM) RPM defines how many API requests you can make every minute. For instance, GPT-3.5-turbo might allow 350 RPM. DALL·E image generation models might allow 6 RPM. Deployment, File, and Training Limits Here are some standard limits imposed on your OpenAI resource: Resource Type Limit Standard model deployments 32 Fine-tuned model deployments 5 Training jobs 100 total per resource (1 active at a time) Fine-tuning files 50 files (total size: 1 GB) Max prompt tokens per request Varies by model (e.g., 4096 tokens for GPT-3.5) How to View and Manage Your Quota Step-by-Step: Go to the Azure Portal. Navigate to your Azure OpenAI resource. Click on "Usage + quotas" in the left-hand menu. You will see TPM, RPM, and current usage status. To Request More Quota: In the same "Usage + quotas" panel, click on "Request quota increase". Fill in the form: Select the region. Choose the model family (e.g., GPT-4, GPT-3.5). Enter the desired TPM and RPM values. Submit and wait for Azure to review and approve. What is Dynamic Quota? Sometimes, Azure gives you extra quota based on demand and availability. “Dynamic quota” is not guaranteed and may increase or decrease. Useful for short-term spikes but should not be relied on for production apps. Example: During weekends, your GPT-3.5 TPM may temporarily increase if there's less traffic in your region. Best Practices for Students Monitor Regularly: Use the Azure Portal to keep an eye on your usage. Batch Requests: Combine multiple tasks in one API call to save tokens. Start Small: Begin with GPT-3.5 before requesting GPT-4 access. Plan Ahead: If you're preparing a demo or a project, request quota in advance. Handle Limits Gracefully: Code should manage 429 Too Many Requests errors. Quick Resources Azure OpenAI Quotas and Limits How to Request Quota in Azure Join the Conversation on Azure AI Foundry Discussions! Have ideas, questions, or insights about AI? Don't keep them to yourself! Share your thoughts, engage with experts, and connect with a community that’s shaping the future of artificial intelligence. 🧠✨ 👉 Click here to join the discussion!
Sharda_Kaur
May 25, 2026 Place Educator Developer Blog
3KViews
0likes
0Comments
Getting Started with the AI Toolkit: A Beginner’s Guide with Demos and Resources
If you're curious about building AI solutions but don’t know where to start, Microsoft’s AI Toolkit is a great place to begin. Whether you’re a student, developer, or just someone exploring AI for the first time, this toolkit helps you build real-world solutions using Microsoft’s powerful AI services. In this blog, I’ll Walk you through what the AI Toolkit is, how you can get started, and where you can find helpful demos and ready-to-use code samples. What is the AI Toolkit? The AI Toolkit is a collection of tools, templates, and sample apps that make it easier to build AI-powered applications and copilots using Microsoft Azure. With the AI Toolkit, you can: Build intelligent apps without needing deep AI expertise. Use templates and guides that show you how everything works. Quickly prototype and deploy apps with natural language, speech, search, and more. Watch the AI Toolkit in Action Microsoft has created a video playlist that covers the AI Toolkit and shows you how to build apps step-by-step. You can watch the full playlist here: It is especially useful for developers who want to bring AI into their projects, but also for beginners who want to learn by doing. AI Toolkit Playlist – https://aka.ms/AIToolkit/videos These videos help you understand the flow of building AI agents, using Azure OpenAI, and other cognitive services in a hands-on way. Explore Sample Projects on GitHub Microsoft also provides a public GitHub repository where you can find real code examples built using the AI Toolkit. Here’s the GitHub repo: AI Toolkit Samples – https://github.com/Azure-Samples/AI_Toolkit_Samples This repository includes: Sample apps using Azure AI services like OpenAI, Cognitive Search, and Speech. Instructions to deploy apps using Azure. Code that you can clone, test, and build on top of. You don’t have to start from scratch just open the code, understand the structure, and make small edits to experiment. How to Get Started Here’s a simple path if you’re just starting: Watch 2 or 3 videos from the AI Toolkit Playlist. Go to the GitHub repository and try running one of the examples. Make small changes to the code (like updating the prompt or output). Try deploying the solution on Azure by following the guide in the repo. Keep building and learning. Why This Toolkit is Worth Exploring As someone who is also learning and experimenting, I found this toolkit to be: Easy to understand, even for beginners. Focused on real-world applications, not just theory. Helpful for building responsible AI solutions with good documentation. It gives a complete picture — from writing code to deploying apps. Final Thoughts The AI Toolkit helps you start your journey in AI without feeling overwhelmed. It provides real code, real use cases, and practical demos. With the support of Microsoft Learn and Azure samples, you can go from learning to building in no time. If you’re serious about building with AI, this is a resource worth exploring. Continue the discussion in the Azure AI Foundry Discord community at Https://aka.ms/AI/discord Join the Azure AI Foundry Discord Server! References AI Toolkit Playlist (YouTube) https://aka.ms/AIToolkit/videos AI Toolkit GitHub Repository https://github.com/Azure-Samples/AI_Toolkit_Samples Microsoft Learn: AI Toolkit Documentation https://learn.microsoft.com/en-us/azure/ai-services/toolkit/ Azure AI Services https://azure.microsoft.com/en-us/products/ai-services/
Sharda_Kaur
May 25, 2026 Place Educator Developer Blog
1.9KViews
0likes
0Comments
Signing in to Microsoft Foundry from OpenClaw using Azure AD: a smoother way to bring your models in
This post is a quick update to walk through the new flow. If you read the previous one, think of this as the easier path I wish I had the first time round. If you have not seen the original, you can find it here: Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration | Microsoft Community Hub Pre-requisite: You will need the Azure CLI (azure-cli) installed on your machine. The official install guide for Linux is here: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux?view=azure-cli-latest I am on Linux so I went the Homebrew route, which keeps things simple. The formula is here: https://formulae.brew.sh/formula/azure-cli Microsoft also has official docs covering the Homebrew/Linuxbrew install: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-macos?view=azure-cli-latest#install-with-homebrew Once Homebrew is ready, run this in your terminal: brew install azure-cli Why this matters: Before this update, every Foundry model you wanted to use in OpenClaw needed its own API key and endpoint pasted into the config. It worked, but it was tedious, and keys are easy to leak if you are copying them around. The Azure AD path solves both problems. You authenticate as yourself (or a service principal), OpenClaw asks Azure for the list of Foundry resources you have access to, and it brings the models in automatically. Signing in to Microsoft Foundry from OpenClaw via Azure AD A device-code OAuth handshake replaces the old static-API-key flow. OpenClaw delegates auth to the local Azure CLI; the CLI handles the browser-side sign-in, holds the resulting tokens, and refreshes them silently. OpenClaw then walks the Azure resource graph, subscriptions → Foundry resources → model deployments and registers each model into its own config. No API keys move through OpenClaw at any point. Sequence diagram of the OAuth 2.0 device-authorization flow as orchestrated by OpenClaw. Phases 1–3 establish identity (the developer authenticates once, in a real browser, against Azure AD). Phases 4–5 perform service discovery (OpenClaw walks the ARM resource hierarchy, subscriptions → Foundry accounts → model deployments and persists the result to a local provider config). After registration, every model call OpenClaw makes against Foundry reuses the same Azure-CLI-managed token cache: tokens refresh transparently, and access is gated by the Foundry resource's RBAC assignments rather than a static API key. Dashed lines denote return values; the teal line in step 7 marks the single token-issuance event the rest of the system pivots on. Walking through the new flow: Start with the command to onboard openclaw as if you were setting up OpenClaw for the first time: openclaw onboard Kick things off with the OpenClaw onboard command, the same one you would use when setting up OpenClaw for the first time. When it prompts you, choose update values. Next, you will be asked to configure your models. Scroll down a little and you will see Microsoft Foundry listed as a supported provider. Pick it. From here, you have two options. You can sign in with an API key, which is what I covered in the previous blog post, or you can sign in through Azure AD. The Azure AD path is easier and more secure, so that is the one we will use. OpenClaw will give you a URL and a device code. Copy the URL into your browser and use the code to complete the sign in. (This is where the az CLI from the pre-requisite section earns its keep.) If everything worked, you should see a success prompt similar to this: Once you are signed in, OpenClaw will ask you to pick the Azure subscription that your Microsoft Foundry resource lives in. Pick the subscription, then pick the Foundry resource where your models are deployed. And that is pretty much it. All the models you have deployed to that Foundry resource get pulled into OpenClaw automatically. Compared to the old way of pasting API keys and endpoints one by one, this is a huge time saver, and you do not have to babysit any keys. From here you can start using your Foundry-deployed models inside OpenClaw straight away: Wrapping up The Azure AD sign-in option in OpenClaw is one of those small updates that quietly removes a real pain point. If you have ever juggled multiple Foundry endpoints and rotated keys across them, you already know why. With this flow, you sign in once, your models show up, and you can get back to actually building. If you have not tried OpenClaw with Microsoft Foundry yet, this is a good time to give it a go. And if you were holding off because of the key management overhead, that excuse is gone now. References Previous post on integrating Microsoft Foundry with OpenClaw using API keys: Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration | Microsoft Community Hub Install the Azure CLI on Linux: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux?view=azure-cli-latest Install the Azure CLI on macOS: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-macos?view=azure-cli-latest#install-with-homebrew Homebrew formula for azure-cli: https://formulae.brew.sh/formula/azure-cli
suzarilshah
May 20, 2026 Place Educator Developer Blog
293Views
0likes
0Comments
Power Up Your Open WebUI with Azure AI Speech: Quick STT & TTS Integration
Introduction Ever found yourself wishing your web interface could really talk and listen back to you? With a few clicks (and a bit of code), you can turn your plain Open WebUI into a full-on voice assistant. In this post, you’ll see how to spin up an Azure Speech resource, hook it into your frontend, and watch as user speech transforms into text and your app’s responses leap off the screen in a human-like voice. By the end of this guide, you’ll have a voice-enabled web UI that actually converses with users, opening the door to hands-free controls, better accessibility, and a genuinely richer user experience. Ready to make your web app speak? Let’s dive in. Why Azure AI Speech? We use Azure AI Speech service in Open Web UI to enable voice interactions directly within web applications. This allows users to: Speak commands or input instead of typing, making the interface more accessible and user-friendly. Hear responses or information read aloud, which improves usability for people with visual impairments or those who prefer audio. Provide a more natural and hands-free experience especially on devices like smartphones or tablets. In short, integrating Azure AI Speech service into Open Web UI helps make web apps smarter, more interactive, and easier to use by adding speech recognition and voice output features. If you haven’t hosted Open WebUI already, follow my other step-by-step guide to host Ollama WebUI on Azure. Proceed to the next step if you have Open WebUI deployed already. Learn More about OpenWeb UI here. Deploy Azure AI Speech service in Azure. Navigate to the Azure Portal and search for Azure AI Speech on the Azure portal search bar. Create a new Speech Service by filling up the fields in the resource creation page. Click on “Create” to finalize the setup. After the resource has been deployed, click on “View resource” button and you should be redirected to the Azure AI Speech service page. The page should display the API Keys and Endpoints for Azure AI Speech services, which you can use in Open Web UI. Settings things up in Open Web UI Speech to Text settings (STT) Head to the Open Web UI Admin page > Settings > Audio. Paste the API Key obtained from the Azure AI Speech service page into the API key field below. Unless you use different Azure Region, or want to change the default configurations for the STT settings, leave all settings to blank. Text to Speech settings (TTS) Now, let's proceed with configuring the TTS Settings on OpenWeb UI by toggling the TTS Engine to Azure AI Speech option. Again, paste the API Key obtained from Azure AI Speech service page and leave all settings to blank. You can change the TTS Voice from the dropdown selection in the TTS settings as depicted in the image below: Click Save to reflect the change. Expected Result Now, let’s test if everything works well. Open a new chat / temporary chat on Open Web UI and click on the Call / Record button. The STT Engine (Azure AI Speech) should identify your voice and provide a response based on the voice input. To test the TTS feature, click on the Read Aloud (Speaker Icon) under any response from Open Web UI. The TTS Engine should reflect Azure AI Speech service! Conclusion And that’s a wrap! You’ve just given your Open WebUI the gift of capturing user speech, turning it into text, and then talking right back with Azure’s neural voices. Along the way you saw how easy it is to spin up a Speech resource in the Azure portal, wire up real-time transcription in the browser, and pipe responses through the TTS engine. From here, it’s all about experimentation. Try swapping in different neural voices or dialing in new languages. Tweak how you start and stop listening, play with silence detection, or add custom pronunciation tweaks for those tricky product names. Before you know it, your interface will feel less like a web page and more like a conversation partner.
suzarilshah
Apr 13, 2026 Place Educator Developer Blog
2.5KViews
3likes
2Comments
ProvePresent: Ending Proxy Attendance with Azure Serverless & Azure OpenAI
Problem Most schools use a smart‑card‑based attendance system where students tap their cards on a reader. However, this method is unreliable because students can give their cards to friends or simply tap and leave immediately. Teachers cannot accurately assess real student performance—whether high‑performing students are genuinely attending class or whether poor performance is due to actual absence. Another issue is that even if students are physically present in a lecture, teachers still cannot tell whether they are paying attention to the projector or actually learning. The current workaround is for teachers to override the attendance record by calling each student one by one, which is time‑consuming in large lectures and adds little educational value. It is also only a one‑time check, meaning students can still leave the lecture room immediately afterwards. Another issue is that we have many out‑of‑school activities such as site visit, and the school needs to ensure everyone’s presence promptly in each check point. This kind of problem isn’t unique to schools. It’s a common challenge for event organizers, where verifying attendee presence is essential but often slow, causing long queues. Organizers usually rely on a few mobile scanners to check in attendees one by one. Solution ProvePresent is an AI tool designed to verify attendance and create real‑time challenges for participants, ensuring that attendance records are authentic and that attendees remain focused on the presentation. It uses OTP login with school email. Check-in and Check-out With a Real‑time QR Code The code refreshes every 25 seconds, and the presenter can display it on the projector for everyone to scan when checking in at the beginning and checking out at the end of the session. However, this alone cannot prevent someone from capturing the code and sending it to others who are not in the room, or from using two devices to help someone else scan for attendance—even if geolocation checks are enabled. We will explain this next. This check‑in and check‑out process is highly scalable, and no one needs to queue while waiting for someone to scan their QR code! Organizers can set geolocation restrictions to prevent anyone from checking in remotely in a simple manner. Keep Attendee Alive with Signalr The SignalR live connection allows the presenter to create real‑time challenges for attendees, helping to verify their presence and ensure they are genuinely focused on the presentation. AI Powered Live Quiz The presenter shares their presentation screen, and two Microsoft Foundry agents with Azure OpenAI Chatgpt 5.3 —ImageAnalysisAgent, which extracts key information from the shared screen, and QuizQuestionGenerator, which generates simple questions based on the current slide—work together to create challenges. The question is broadcast to all online attendees, who must answer within 20 seconds. This feature keeps attendees on the webpage and prevents them from doing anything unrelated to the presentation. Detailed report can be downloaded for further analysis. Attendee Photo Capture Request all online students to capture and upload photos of their venue view. The system will analyze the images to estimate seating positions using Microsoft Foundry agents with Azure OpenAI ChatGPT 5.3 PositionEstimationAgent and complete an image challenge. When the presenter clicks Capture Attendee Photos, all online attendees are prompted to take a photo and upload it to blob storage. The PositionEstimationAgent then analyzes the image to estimate their seating location, which can provide insights into student performance. Analysis Notes: Analyzed 13 students in 2 overlapping batches. Batch 1: The venue is a computer lab with the projector screen at the front center, whiteboards on the left, and cabinets on the right. Relative depth was estimated mainly from screen size and number of monitor rows visible ahead. Column estimates were inferred from screen angle and side-room features, with lower confidence for the rotated side-view image. Batch 2: These six photos appear to come from the same computer lab with the projector at the front center. Relative depth was estimated mainly from projector size and number of visible desk/monitor rows ahead. Left-right placement was inferred from projector skew and side-wall visibility. Within this batch, 240124734 and 240167285 seem closest to the front, 240286514 and 240158424 are slightly farther back, 240293498 is farther back again, and 240160364 appears furthest. Pass around the QR code attendance sheet Traditionally, the attendance sheet is circulated for attendees to sign, but this method is unreliable because no one monitors the signing process, allowing one attendee to sign for someone who is absent. It is also slow and not scalable for large groups. The QR Code attendance sheet functions as a chain. The presenter randomly distributes a short‑lived, one‑time QR code—representing a virtual attendance sheet—to any number of attendees, just like handing out multiple physical sheets. Each attendee must find another participant to scan their code to record attendance, continuing the chain until the final group of attendees. The presenter then verifies the last group’s presence. The first chain is a dead chain because that student left the venue and cannot find another student to scan his QR code. The second chain contains 20 student attendance records. It also provides useful insights into their friendship and seating patterns. Architecture This project is built using Vibe Coding, so we will not share highly technical details in this post. If you'd like to learn more, leave a comment, and we will write another blog to cover the specifics. GitHub Repo https://github.com/wongcyrus/ProvePresent Conclusion ProvePresent demonstrates how Azure serverless technology and Azure OpenAI can work together to solve a long‑standing problem in education: verifying genuine student presence and engagement. By combining real‑time QR code verification, SignalR‑powered live interactions, AI‑generated quizzes, and intelligent photo‑based seating analysis, we created a system where “being present” is no longer just a checkbox—it becomes a verifiable, interactive, and meaningful part of the learning experience. Instead of relying on outdated smart‑card systems or manual roll calls, educators gain a dynamic tool that keeps students attentive, provides insight into classroom behavior, and produces useful analytics for improving teaching outcomes. Students, in turn, benefit from an engaging, modern attendance experience that aligns with how digital‑native learners expect classes to operate. This is only the beginning. With Microsoft Foundry agents and the flexibility of Azure Functions, there are many opportunities to extend ProvePresent further—richer analytics, smarter engagement models, and seamless integration with LMS platforms. If there’s interest, we’re happy to share more technical details, architectural deep dives, and future roadmap ideas in a follow‑up post. Thank you for the contribution of Microsoft Student Ambassadors Hong Kong Institute of Information Technology (HKIIT) Wong Wing Ho, CHAN Sham Jayson, Pang Ho Shum, and Chan Ka Chun. They are major in Higher Diploma in Cloud and Data Centre Administration. About the Author Cyrus Wong is the senior lecturer of Hong Kong Institute of Information Technology (HKIIT) @ IVE(Lee Wai Lee).and he focuses on teaching public Cloud technologies. He is a passionate advocate for the adoption of cloud technology across various media and events. With his extensive knowledge and expertise, he has earned prestigious recognitions such as AWS Builder Center, Microsoft MVP- Microsoft Foundry, and Google Developer Expert for Google Cloud Platform & AI.
cyruswong
Mar 19, 2026 Place Educator Developer Blog
240Views
0likes
0Comments
Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration
Step 1: Deploying Models on Microsoft Foundry Let us kick things off in the Azure portal. To get our OpenClaw agent thinking like a genius, we need to deploy our models in Microsoft Foundry. For this guide, we are going to focus on deploying gpt-5.2-codex on Microsoft Foundry with OpenClaw. Navigate to your AI Hub, head over to the model catalog, choose the model you wish to use with OpenClaw and hit deploy. Once your deployment is successful, head to the endpoints section. Important: Grab your Endpoint URL and your API Keys right now and save them in a secure note. We will need these exact values to connect OpenClaw in a few minutes. Step 2: Installing and Initializing OpenClaw Next up, we need to get OpenClaw running on your machine. Open up your terminal and run the official installation script: curl -fsSL https://openclaw.ai/install.sh | bash The wizard will walk you through a few prompts. Here is exactly how to answer them to link up with our Azure setup: First Page (Model Selection): Choose "Skip for now". Second Page (Provider): Select azure-openai-responses. Model Selection: Select gpt-5.2-codex , For now only the models listed (hosted on Microsoft Foundry) in the picture below are available to be used with OpenClaw. Follow the rest of the standard prompts to finish the initial setup. Step 3: Editing the OpenClaw Configuration File Now for the fun part. We need to manually configure OpenClaw to talk to Microsoft Foundry. Open your configuration file located at ~/.openclaw/openclaw.json in your favorite text editor. Replace the contents of the models and agents sections with the following code block: { "models": { "providers": { "azure-openai-responses": { "baseUrl": "https://<YOUR_RESOURCE_NAME>.openai.azure.com/openai/v1", "apiKey": "<YOUR_AZURE_OPENAI_API_KEY>", "api": "openai-responses", "authHeader": false, "headers": { "api-key": "<YOUR_AZURE_OPENAI_API_KEY>" }, "models": [ { "id": "gpt-5.2-codex", "name": "GPT-5.2-Codex (Azure)", "reasoning": true, "input": ["text", "image"], "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, "contextWindow": 400000, "maxTokens": 16384, "compat": { "supportsStore": false } }, { "id": "gpt-5.2", "name": "GPT-5.2 (Azure)", "reasoning": false, "input": ["text", "image"], "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }, "contextWindow": 272000, "maxTokens": 16384, "compat": { "supportsStore": false } } ] } } }, "agents": { "defaults": { "model": { "primary": "azure-openai-responses/gpt-5.2-codex" }, "models": { "azure-openai-responses/gpt-5.2-codex": {} }, "workspace": "/home/<USERNAME>/.openclaw/workspace", "compaction": { "mode": "safeguard" }, "maxConcurrent": 4, "subagents": { "maxConcurrent": 8 } } } } You will notice a few placeholders in that JSON. Here is exactly what you need to swap out: Placeholder Variable What It Is Where to Find It <YOUR_RESOURCE_NAME> The unique name of your Azure OpenAI resource. Found in your Azure Portal under the Azure OpenAI resource overview. <YOUR_AZURE_OPENAI_API_KEY> The secret key required to authenticate your requests. Found in Microsoft Foundry under your project endpoints or Azure Portal keys section. <USERNAME> Your local computer's user profile name. Open your terminal and type whoami to find this. Step 4: Restart the Gateway After saving the configuration file, you must restart the OpenClaw gateway for the new Foundry settings to take effect. Run this simple command: openclaw gateway restart Configuration Notes & Deep Dive If you are curious about why we configured the JSON that way, here is a quick breakdown of the technical details. Authentication Differences Azure OpenAI uses the api-key HTTP header for authentication. This is entirely different from the standard OpenAI Authorization: Bearer header. Our configuration file addresses this in two ways: Setting "authHeader": false completely disables the default Bearer header. Adding "headers": { "api-key": "<key>" } forces OpenClaw to send the API key via Azure's native header format. Important Note: Your API key must appear in both the apiKey field AND the headers.api-key field within the JSON for this to work correctly. The Base URL Azure OpenAI's v1-compatible endpoint follows this specific format: https://<your_resource_name>.openai.azure.com/openai/v1 The beautiful thing about this v1 endpoint is that it is largely compatible with the standard OpenAI API and does not require you to manually pass an api-version query parameter. Model Compatibility Settings "compat": { "supportsStore": false } disables the store parameter since Azure OpenAI does not currently support it. "reasoning": true enables the thinking mode for GPT-5.2-Codex. This supports low, medium, high, and xhigh levels. "reasoning": false is set for GPT-5.2 because it is a standard, non-reasoning model. Model Specifications & Cost Tracking If you want OpenClaw to accurately track your token usage costs, you can update the cost fields from 0 to the current Azure pricing. Here are the specs and costs for the models we just deployed: Model Specifications Model Context Window Max Output Tokens Image Input Reasoning gpt-5.2-codex 400,000 tokens 16,384 tokens Yes Yes gpt-5.2 272,000 tokens 16,384 tokens Yes No Current Cost (Adjust in JSON) Model Input (per 1M tokens) Output (per 1M tokens) Cached Input (per 1M tokens) gpt-5.2-codex $1.75 $14.00 $0.175 gpt-5.2 $2.00 $8.00 $0.50 Conclusion: And there you have it! You have successfully bridged the gap between the enterprise-grade infrastructure of Microsoft Foundry and the local autonomy of OpenClaw. By following these steps, you are not just running a chatbot; you are running a sophisticated agent capable of reasoning, coding, and executing tasks with the full power of GPT-5.2-codex behind it. The combination of Azure's reliability and OpenClaw's flexibility opens up a world of possibilities. Whether you are building an automated devops assistant, a research agent, or just exploring the bleeding edge of AI, you now have a robust foundation to build upon. Now it is time to let your agent loose on some real tasks. Go forth, experiment with different system prompts, and see what you can build. If you run into any interesting edge cases or come up with a unique configuration, let me know in the comments below. Happy coding!
suzarilshah
Mar 06, 2026 Place Educator Developer Blog
11KViews
2likes
2Comments