Blog articles

The Rise of vLLM in Modern Cloud Development: Revolutionizing AI Inference

KonstantinosPassadis — Tue, 03 Mar 2026 23:34:52 GMT

Introduction to the AI Cloud Bottleneck

The emergence of Large Language Models (LLMs) has revolutionized cloud applications, from universal chatbots to automated programming assistants. However, hosting these models in the cloud is notoriously expensive due to the massive computational and memory requirements of hardware accelerators like GPUs. During text generation, models autoregressively produce tokens one at a time, relying on a dynamically growing Key-Value (KV) cache that acts as the model's short-term memory.

Traditional inference systems store this KV cache in contiguous memory blocks, pre-allocating space for the maximum potential request length. This approach results in severe memory fragmentation and reservation waste, leaving up to 80% of GPU memory unused and crippling the system's ability to handle large batches of concurrent users.

The Core Innovation: PagedAttention

To solve this memory bottleneck, researchers developed vLLM, an open-source inference engine built around a breakthrough algorithm called PagedAttention. Inspired by how operating systems manage virtual memory via paging, PagedAttention divides the KV cache into small, fixed-size blocks (pages) that do not need to be stored contiguously in physical memory.

By allocating memory blocks on demand as tokens are generated, vLLM practically eliminates external fragmentation and minimizes internal fragmentation. This highly efficient memory management limits memory waste to under 4%, allowing the system to batch significantly more requests concurrently. As a result, vLLM delivers up to 24x higher throughput than standard Hugging Face Transformers and up to 3.5x higher throughput than Hugging Face's Text Generation Inference (TGI), all without requiring any changes to the underlying model architecture.

Advanced Features Powering the Cloud

Modern cloud development requires speed, scalability, and hardware flexibility. vLLM accelerates enterprise AI pipelines through several specialized optimizations:

Continuous Batching: Instead of waiting for a static batch of requests to completely finish, vLLM dynamically injects new requests the moment an existing sequence completes, keeping GPU utilization consistently high.
Speculative Decoding: vLLM integrates state-of-the-art speculative decoding techniques, such as Eagle 3, which uses a smaller, faster "draft" model to predict tokens before the main model verifies them. This can boost inference speeds by up to 2.5x.
Automatic Prefix Caching & Memory Sharing: For applications with shared system prompts or multi-step reasoning (like beam search), vLLM allows different sequences to share the same KV cache blocks. This is highly beneficial for Retrieval-Augmented Generation (RAG) and multi-round chat workloads.
Quantization Support: Cloud developers can leverage 8-bit or 4-bit quantization (like GPTQ or AWQ) to shrink massive models, allowing them to fit onto smaller, more cost-effective cloud GPUs.

Enterprise Deployment and Cloud Orchestration

From an infrastructure perspective, vLLM is built for modern cloud-native deployment. It provides a production-ready server that mimics the OpenAI API protocol, allowing developers to use it as a drop-in replacement in existing applications, including those built on frameworks like LangChain.

For large-scale, cluster-wide deployments, vLLM integrates seamlessly with Kubernetes. The vLLM production stack offers Helm charts, Prometheus and Grafana for observability metrics (such as Time-to-First-Token and GPU KV usage), and smart request routing to distribute workloads effectively across backend GPUs.

Deploying vLLM with Docker is the standard way to ensure your environment has the correct CUDA drivers and dependencies without manual configuration.

Single-GPU Deployment (The Quickstart)

Use this command to spin up an OpenAI-compatible server. This example uses the Llama-3.1-8B model.

Furthermore, vLLM is hardware agnostic—meaning cloud engineers can deploy it across NVIDIA, AMD, Google TPUs, or AWS Neuron chips depending on their cloud provider. Serverless platforms like Modal and Runpod also natively support vLLM, allowing teams to instantly spin up autoscaling endpoints without managing idle GPU costs.

docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model meta-llama/Llama-3.1-8B-Instruct

What these flags do:

--runtime nvidia: Enables GPU access (ensure NVIDIA Container Toolkit is installed).

-v ...: Mounts your local Hugging Face cache so you don't re-download the model every time the container restarts.

--ipc=host: Essential for high-speed memory sharing between the container and the GPU.

--model: The Hugging Face model ID.

Multi-GPU Deployment (Docker Compose)

For production environments or massive models (like a 70B parameter model) that require multiple GPUs, use docker-compose.yml.

services: vllm: image: vllm/vllm-openai:latest container_name: vllm-server environment: - HF_TOKEN=${HF_TOKEN} ports: - "8000:8000" volumes: - ~/.cache/huggingface:/root/.cache/huggingface ipc: host deploy: resources: reservations: devices: - driver: nvidia count: all # Uses all available GPUs capabilities: [gpu] command: > --model meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 4 --max-model-len 4096

Real-World Enterprise Impact

Major tech companies are actively leveraging vLLM to scale their cloud AI features:
Roblox deployed vLLM to serve over 4 billion tokens a week for their AI assistant, achieving a 50% reduction in latency.
LinkedIn uses vLLM’s continuous batching and shared prefix caching to power its Hiring Assistant, handling thousands of candidate profiles with heavy prompt overlaps and improving token generation times by 7%.
Amazon integrated vLLM into a multi-node architecture to support its Rufus shopping assistant, dynamically distributing inference across the cloud to handle millions of customer queries without performance drops.
In conclusion, vLLM is redefining modern cloud development by turning memory-bound, resource-heavy LLMs into scalable, cost-efficient microservices.

Which one to choose ?

To help you choose the right tool for your specific cloud environment, here is a comparison of vLLM against the other two industry heavyweights: Hugging Face Text Generation Inference (TGI) and NVIDIA TensorRT-LLM.

Choose vLLM if: You need to serve a large number of concurrent users on a budget. It is the most flexible option if you want to avoid vendor lock-in and deploy across different cloud providers (e.g., switching between AWS G5 instances and Google Cloud TPUs).

Choose TGI if: Your infrastructure is already built around the Hugging Face ecosystem and you prioritize production stability. It is particularly strong for long-context RAG applications where you need to cache massive system prompts (like internal legal databases) across multiple requests.

Choose TensorRT-LLM if: You are chasing the absolute lowest possible latency (e.g., real-time voice AI) and you have committed entirely to high-end NVIDIA hardware like H100s or B200s. It requires more engineering effort to compile models, but it squeezes every drop of power out of the GPU.

Conclusion

In short, vLLM is the bridge between AI research and cloud-scale reality. By treating GPU memory with the same logic as a modern operating system, it has effectively solved the "fragmentation crisis" that once made high-performance inference prohibitively expensive. For developers and enterprises, this means the ability to serve more users, on more diverse hardware, at a fraction of the previous cost—all without sacrificing the flexibility of open-source models.

How We Built an AI Operations Agent Using MCP Servers and Dynamic Tool Routing

KonstantinosPassadis — Tue, 10 Feb 2026 20:49:22 GMT

In this post, we’re going to tackle a massive challenge in the agent space: Safety and Visibility. We are going to build a practical demo that connects two distinct MCP servers to a single agent service using the Microsoft Agents SDK and Azure OpenAI. To top it off, we’ll wrap it all in a lightweight web UI (AG-UI) that streams text, traces tool calls, and—crucially—gates state-changing actions behind human approval.

The Problem: Why Do We Need This?

As agent-based applications get more complex, we start hitting the same headaches over and over. We want agents to work with real backends, but we keep running into familiar pitfalls.

The “Black Box” Issue: Tool calls happen out of sight, so users have no idea what the agent is doing—instantly killing trust.

Tangled Logic: Backend logic gets crammed into prompts, turning into messy spaghetti that’s hard to test, deploy, and improve.

Unsafe Writes: An agent might update a database or delete a file without any human in the loop.

Our Goal: Keep backends modular with MCP, centralize the agent’s “brain,” and give users a UI that makes every tool action clear and trustworthy.

The Pitch: Backends stay as MCP tools, the agent brain lives in one service, and the UI makes tool activity fully transparent.

The Architecture

To solve this, we are using a microservices approach with Azure at its core.

High-Level Components

Policy MCP Server: Connects to Azure Blob Storage and serves as the source of truth for policy documents.

Order MCP Server: Connects to Azure SQL, managing structured order data.

Agent + AG-UI Service (FastAPI): The core of the system, linking to the MCP servers, running the agent through the Microsoft Agents SDK, and streaming events directly to the browser.

Web UI: A straightforward HTML/CSS/JavaScript frontend that displays the chat experience, tool traces, images, and human-approval cards.

The Data Flow (Mental Model)

Understanding the flow is key for effective debugging and observability. Here’s how a single request moves through the system:

Prompt: The browser sends a user prompt to the Agent Service.

Stream: The Agent Service instantly streams events back to the UI, including assistant text (token-by-token), tool-call traces (arguments and results), and custom UI elements like image cards or approval requests.

Execution: The Agent Service calls the appropriate MCP tools (Policy or Orders) via SSE and JSON-RPC.

Guardrails: For tools that change state (like updating an order), the agent pauses and explicitly requests human approval before proceeding.

Sample Client (AGUI) Code

# Convenience: if a tool returns an image URL (or JSON containing one), emit an AG-UI Custom event # so clients can render it as a rich card. def _looks_like_image_url(value: str) -> bool: v = value.lower().split("?")[0].split("#")[0] if not (v.startswith("http://") or v.startswith("https://")): return False return v.endswith((".png", ".jpg", ".jpeg", ".gif", ".webp", ".svg")) image_url: str | None = None if isinstance(result, str) and _looks_like_image_url(result.strip()): image_url = result.strip() else: try: parsed = json.loads(result) if isinstance(parsed, dict): for k in ("imageUrl", "image_url", "url", "image"): v = parsed.get(k) if isinstance(v, str) and _looks_like_image_url(v.strip()): image_url = v.strip() break except Exception: pass if image_url: emit({"type": "Custom", "name": "image", "value": {"url": image_url, "alt": tool_name}}) emit({"type": "StepFinished", "stepName": step_name})

Sample Server (Policy Documents MCP Server)

_server.call_tool() async def call_tool( name: str, arguments: dict ) -> list[types.TextContent | types.ImageContent | types.EmbeddedResource]: if name == "list_policy_documents": try: client = _blob_service_client() storage = _safe_storage_info(client) available = _list_blobs(limit=200) return [ types.TextContent( type="text", text=json.dumps({"storage": storage, "available": available}, ensure_ascii=False), ) ] except Exception as e: return [types.TextContent(type="text", text=f"Error listing policies: {type(e).__name__}: {e}")] if name == "read_policy_document": requested = arguments.get("doc_name") if not requested: return [types.TextContent(type="text", text="Error: doc_name is required.")] doc_name = _name_map(requested) try: client = _blob_service_client() container = client.get_container_client(CONTAINER_NAME) if not container.exists(): storage = _safe_storage_info(client) return [ types.TextContent( type="text", text=( "Policy container not found. " + json.dumps({"storage": storage}, ensure_ascii=False) ), ) ] blob_client = container.get_blob_client(doc_name) if not blob_client.exists(): storage = _safe_storage_info(client) available = _list_blobs(limit=50) return [ types.TextContent( type="text", text=( f"Document '{doc_name}' not found in policy library. " + json.dumps({"storage": storage, "available": available}, ensure_ascii=False) ), ) ] content = blob_client.download_blob().readall().decode("utf-8") return [types.TextContent(type="text", text=content)] except Exception as e: return [types.TextContent(type="text", text=f"Error accessing policy library: {type(e).__name__}: {e}")] raise ValueError(f"Unknown tool: {name}")

Sample Server ( Orders MCP Server)

from mcp.server import Server from mcp.server.sse import SseServerTransport import mcp.types as types import os import pyodbc import json # Initialize MCP Server mcp_server = Server("SQLOrderAgent") # SQL Configuration SQL_CONNECTION_STRING = os.getenv("SQL_CONNECTION_STRING") def get_db_connection(): if not SQL_CONNECTION_STRING: raise ValueError("SQL_CONNECTION_STRING environment variable is not set.") return pyodbc.connect(SQL_CONNECTION_STRING) def dict_from_row(cursor): columns = [column[0] for column in cursor.description] return [dict(zip(columns, row)) for row in cursor.fetchall()] def _column_exists(conn: pyodbc.Connection, table_name: str, column_name: str) -> bool: cursor = conn.cursor() cursor.execute( """ SELECT 1 FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = ? AND COLUMN_NAME = ? """, (table_name, column_name), ) return cursor.fetchone() is not None @mcp_server.list_tools() async def list_tools() -> list[types.Tool]: return [ types.Tool( name="get_order_details", description="Queries the SQL database for order details (status, priority, category, and optional fields like photo/address/remarks if present).", inputSchema={ "type": "object", "properties": { "order_id": {"type": "string", "description": "The ID of the order."} }, "required": ["order_id"] } ), types.Tool( name="get_order_address", description="Returns an order's delivery address (and remarks if available) from the SQL database.", inputSchema={ "type": "object", "properties": { "order_id": {"type": "string", "description": "The ID of the order."} }, "required": ["order_id"] }, ),

Scenarios

Scenario 1: Image Rendering (Read-Only)

Prompt: “Show me the photo for order 5390”

What happens:

The agent calls get_order_photo.
The UI receives a Custom:image event.
An image card is rendered directly inside the chat.

Scenario 2: Approval Gating (Human-in-the-Loop)

Prompt: “Set the photo for order 5390 to https://example.com/new_photo.jpg”

What happens:

The agent detects a write operation.
An Approval Card appears in the UI.
Approve: Executes set_order_photo.
Reject: Cancels the action entirely.

Scenario 3: Policy Lookup

Prompt: “List available policy docs, then read the hazardous policy.”

What happens: The agent queries the Policy MCP Server (Azure Blob Storage), lists the available files, and then reads the specific document you requested.

Lessons Learned & Design Patterns

Building this demo revealed some key insights for taking agentic systems to production.

MCP as a Boundary: MCP servers help keep domain tools cleanly separated, with clear ownership—Policy can run their own MCP server, and Orders can manage theirs independently.

Trust through Visibility: Streaming tool traces, including arguments and results, is crucial for smooth debugging and building genuine user trust.

First-Class Approval: Human approval works best when it’s a dedicated UI event that the frontend understands and enforces.

Operational Tips

Reuse Clients: Don’t recreate Azure SDK clients on every request—initialize them once at startup and reuse them.

Log Diagnostics: Always log tool latency and 429 errors to catch bottlenecks early.

Stable Schemas: Keep inputs and outputs small, explicit, and well-defined to cut down on hallucinations and unpredictable behavior.

This post explored a practical approach to building safe, observable agentic systems with MCP, Microsoft Agents SDK, and Azure-native services. By splitting domain logic into MCP servers, centralizing the agent’s “brain,” and streaming every tool action through a human-aware UI, we showed how to replace opaque, risky behavior with trust, control, and visibility—treating human approval as a true first-class interaction for powerful, responsible automation.

Building an Agentic, AI-Powered Helpdesk with Agents Framework, Azure, and Microsoft 365

KonstantinosPassadis — Wed, 03 Dec 2025 04:52:29 GMT

The High-Level Architecture

The core idea is to decouple the system. Instead of one large application doing everything, we split the process into distinct, scalable components.

Ingestion: A lightweight API endpoint simply to capture the request.
Decoupling: A message queue to hold the request for background processing.
Processing: An asynchronous worker that handles all the heavy lifting: AI enrichment, notifications, and decision-making.
Action: A set of automated actions that connect directly to our M365 tools.

Here’s the entire flow visualized as a flowchart:

Step-by-Step Workflow Breakdown

Let's dive into the details of each step.

Ingestion: The FastAPI Endpoint

The user's journey begins at a simple web form (built with FastAPI and Jinja2). The form captures the essential details: Title, Description, Category, Priority, and the user's email.

When the user hits "Submit," the request hits our POST /submit endpoint. This endpoint does two things immediately:

Full Storage: It saves the complete entity (using Category as the PartitionKey and a GUID as the RowKey) into Azure Table Storage for a permanent record.
Compact Message: It sends a compact JSON message (containing just the key info like tableRow, category, priority, etc.) to an Azure Service Bus queue named 'm365'.

This "split" is crucial. The API responds instantly to the user ("Submitted!") without waiting for any complex processing. The entire "heavy" part of the job is now in the queue.

The Asynchronous Worker & AI Enrichment

A separate Python process is constantly listening to the 'm365' Service Bus queue. When our new message arrives, the worker wakes up and:

Parses the compact message.
Uses the partition and row keys to fetch the full entity from Azure Table Storage.
Calls our enrich_helpdesk_entity function, which is a wrapper for Azure OpenAI.

This AI step is where the magic begins. We send a prompt with the user's raw data and ask the AI to return a clean JSON object with an improved title, a concise summary, and a calculated urgency. If the AI fails, it gracefully falls back to using the original user input.

Human-in-the-Loop: Teams Notification

Now that we have a clean, enriched summary, we need to let the support team know. The worker calls send_to_teams, which formats the enriched data into a nice MessageCard and posts it to a designated Teams channel via a webhook.

The support team now sees a clean, AI-summarized notification, giving them instant visibility.

The 'Agent' Decides: AI-Driven Action

This is the "agentic" part of the workflow. Just notifying a channel is good, but true automation means taking the next step.

The worker calls decide_action, which uses the Microsoft Agent Framework (powered by an AzureOpenAIChatClient). We prompt the agent with the key data (category, priority, and the user's original ActionHint).

The agent's job is to intelligently decide the best action. It returns a simple JSON response like { "action": "create-task" }. This is far more powerful than a simple if/else block, as it can be trained to handle nuanced requests. The system defaults to the user's hint if the agent fails.

Execution: Closing the Loop in M365

Based on the agent's decision, the worker executes one of four actions:

notify-team: Uses Azure Communication Services (ACS) to send a formatted email to a distribution list.
create-task: Uses MSAL to get a Microsoft Graph token and directly creates a new task in a specific Planner plan/bucket.
create-ticket: Makes an HTTP POST to a Power Automate flow, which can then connect to any system (like ServiceNow, JIRA, etc.) to create a formal ticket.
store-only: Does nothing further. The request is stored and visible in Teams, but no other action is taken.

Visualizing the Interactions (Sequence Diagram)

Conclusion

This architecture provides a powerful, scalable, and intelligent solution for a common business problem. By leveraging a decoupled, event-driven design with serverless components, the system is both cost-effective and resilient.

The real power, however, comes from the two-stage AI: first, for enrichment (making data human-readable) and second, for decision-making (making the system autonomous). This "agentic" pattern, deeply integrated with the Microsoft 365 ecosystem, is a clear look at the future of business process automation.

Bonus Round: An Analytics Agent for Process Insights

We can easily extend this project by adding a Chat Interface Agent. Imagine a simple chat UI (in Teams, or its own web page) where a support manager can ask, in plain English:

"How many total tickets did we receive today?"
"Show me all 'High' priority requests for the 'IT' category."
"Which team had the most 'create-task' actions assigned?"

Technically, this is another "agent" (powered by Azure OpenAI) that translates the user's natural language question into a valid OData query for our HelpdeskRequests table. It then fetches the data, summarizes it, and presents the answer in the chat. This creates a powerful, conversational "Copilot" for our new helpdesk process, giving us instant, natural language access to our operational data.

Git Repo

Agentic AI Helpdesk

References

For more in-depth information on the services and frameworks used in this post, check out the official Microsoft Learn documentation:

Azure Live Voice API and Avatar Creation Demo

KonstantinosPassadis — Wed, 22 Oct 2025 01:04:29 GMT

How It Works

Azure AI Live Voice API delivers natural, emotionally nuanced speech with low latency, enabling the trainer avatar to speak fluidly and adaptively.
Avatar SDK animates facial expressions, lip sync, and gestures based on the synthesized voice, creating a cohesive and human-like presentation.
The integration ensures consistent flow, where voice and visuals are tightly coupled—no awkward pauses, no robotic delivery.

Use Case: Trainer-to-Student Demo App

Imagine a virtual classroom where:

A digital trainer introduces concepts, explains diagrams, and answers questions—all with realistic voice and avatar presence.
Students engage with content more deeply thanks to the avatar’s expressive delivery and conversational tone.
The system can scale across languages, topics, and formats—perfect for onboarding, education, or enterprise training.

This demo isn’t just a showcase—it’s a blueprint for the future of interactive learning and communication. By combining Azure’s cutting-edge voice synthesis with avatar animation, we’re redefining how knowledge is delivered in digital environments.

- YouTube

Drop your questions or ideas in the comments—I would love to hear how you’re using AI to shape the future of communication.

Until next time, keep building, keep learning, and keep pushing the boundaries of what’s possible.

PS: Soon the code will be shared !

Building an AI Assistant for Microsoft Learn Docs MCP

KonstantinosPassadis — Wed, 09 Jul 2025 11:41:57 GMT

What is Microsoft Learn Docs MCP Server ?

The Microsoft Docs MCP Server is a cloud-hosted service that enables MCP hosts like GitHub Copilot and Cursor to search and retrieve accurate information directly from Microsoft’s official documentation. By implementing the standardized Model Context Protocol (MCP), this service allows any compatible AI system to ground its responses in authoritative Microsoft content.

The “Why”: The Initial Challenge of the AI Assistant

Our initial goal was simple: build a chat interface where a user could ask a question, and we would query this MCP server to get an answer. However, we quickly discovered a challenge. The MCP server is designed for AI agents; it doesn’t just return a single answer. Instead, it returns a rich payload of up to 10 high-quality content chunks from the documentation.

While this is fantastic for providing comprehensive context, it’s overwhelming for a direct chat response. Our app was successfully retrieving information, but it was just a firehose of data. The user was left with the difficult task of sifting through thousands of words to find their answer. This wasn’t a chatbot; it was just a complicated search bar.

We realized we needed to transform this wealth of data into a single, helpful response.

The “How”: A Two-Step Solution (Retrieve and Synthesize)

The solution was to architect our application around a two-step process, a pattern often referred to as Retrieval-Augmented Generation (RAG).

Retrieve: First, connect to the specialized data source (the MCP Server) to fetch relevant, factual context.
Synthesize: Then, provide that context to a general-purpose large language model (LLM) to generate a concise, human-readable answer.

Step 1: The Retrieval Saga – Taming the MCP Server

This was the most challenging part of our journey. Connecting to the MCP server wasn’t straightforward, and it took several iterations of trial and error to get it right. We knew the endpoint was https://learn.microsoft.com/api/mcp, but the exact request format was a mystery we had to solve by carefully analyzing the server’s error messages.

Our attempts ranged from simple, direct MCP messages to various JSON-RPC structures. After a lot of debugging, we landed on the precise payload the server expected. The key was to format the request as a JSON-RPC call with a specific method (tools/call) and parameters that named the tool (microsoft_docs_search) and its arguments (question). Our AI Assistant is ready to work alongside Microsoft Learn MCP!

Working Examples

Here is the final, working code snippet from our Next.js API route (pages/api/mcp.js) that successfully retrieves the context:

// --- STEP 1: RETRIEVE CONTEXT FROM MCP SERVER (Using the final working logic) --- const MCP_SERVER_URL = 'https://learn.microsoft.com/api/mcp'; // This is the successful payload structure you discovered. const mcpPayload = { "jsonrpc": "2.0", "id": `chat-${Date.now()}`, "method": "tools/call", "params": { "name": "microsoft_docs_search", "arguments": { "question": userQuery // Using 'question' as the parameter name. } } }; const mcpResponse = await fetch(MCP_SERVER_URL, { method: 'POST', headers: { 'Content-Type': 'application/json', 'Accept': 'application/json, text/event-stream', 'User-Agent': 'mcp-remote-client', // Adding the User-Agent. }, body: JSON.stringify(mcpPayload), }); // ... code to parse the streaming response ...

With this, we had successfully completed the “Retrieve” step. Our app was now a robust data-fetcher.

Step 2: The Synthesis Engine – Making Sense of the Data with our AI Assistant

Now that we had our 10 chunks of documentation, we needed to add the “brain” to our operation. We chose Azure OpenAI model for this task due to its powerful synthesis capabilities.

The process was as follows:

Extract all the text from the search results returned by the MCP server.
Combine this text into a single, large block of context.
Create a carefully designed prompt that instructs the AI Assistant model to act as a Microsoft expert.
Send the user’s original question along with the retrieved context to the Azure OpenAI API.

The prompt is the most critical piece of this step, as it guides the AI Assistant to produce the desired output. Here’s what our prompt looked like:

This prompt constrains the model to use only the official documentation we provided, ensuring the answers are factual and grounded in a reliable source.

The “What”: The Final Result of our AI Assistant

After implementing both steps, our application was complete. The user interacts with a clean, simple chat interface built with Next.js and React. When they ask a question:

The Next.js backend silently queries the MS Learn MCP Server.
It receives up to 10 articles of context.
The backend passes that context and the original question to the Azure OpenAI API.
Now we receive a concise, summarized answer.
This final answer is displayed to the user in the chat window.

What started as a data firehose was now an intelligent, conversational, and genuinely helpful AI assistant.

Conclusion

This project was a fantastic lesson in modern AI application development. It highlights a powerful pattern: using specialized, data-retrieval tools in tandem with large, general-purpose language models. The journey underscored the importance of persistence in debugging APIs and the art of crafting the perfect prompt. By combining the strengths of different services, we were able to build an application that is far more capable than the sum of its parts.

Git Repo

https://github.com/passadis/mslearn-mcp-chat

Microsoft Learn Docs MCP

https://github.com/MicrosoftDocs/mcp/

Responsible AI and the Evolution of AI Security

KonstantinosPassadis — Wed, 02 Jul 2025 22:12:40 GMT

Why Responsible AI Matters

Responsible AI means designing, developing, and deploying AI systems that are ethical, transparent, and accountable. It's not just about compliance—it's about building trust, protecting users, and ensuring AI benefits everyone.

Key Principles of Responsible AI:

Fairness: Avoiding biases and discrimination by using diverse datasets and regular audits.
Reliability & Safety: Rigorous testing to ensure AI performs as intended, even in unexpected scenarios.
Privacy & Security: Protecting user data with robust safeguards.
Transparency: Making AI decisions explainable and understandable.
Accountability: Establishing governance to address negative impacts.
Inclusiveness: Considering diverse user needs and perspectives.

Responsible AI reduces bias, increases transparency, and builds user trust—critical as AI systems increasingly impact finance, healthcare, public services, and more.

Implementing Responsible AI isn't just about ethical ideals—it's a foundation that demands technical safeguards. For developers, this means translating principles like fairness and transparency into secure code, robust data handling, and model hardening strategies that preempt real-world AI threats.

The Evolution of AI Security: From Afterthought to Essential

AI security has come a long way—from an afterthought to a central pillar of modern digital defense. In the early days, security was reactive, with threats addressed only after damage occurred. The integration of AI shifted this paradigm, enabling proactive threat detection and behavioral analytics that spot anomalies before they escalate.

Key Milestones in AI Security:

Pattern Recognition: Early AI focused on detecting unusual patterns, laying the groundwork for threat detection.
Expert Systems: Rule-based systems in the 1970s-80s emulated human decision-making for security assessments.
Machine Learning: The late 1990s saw the rise of ML algorithms that could analyze vast data and predict threats.
Deep Learning: Neural networks now recognize complex threats and adapt to evolving attack methods.
Real-Time Defense: Modern AI-driven platforms (like Darktrace) create adaptive, self-learning security environments that anticipate and neutralize threats proactively.

Why AI Security Is Now Mandatory

With the explosion of AI-powered applications and cloud services, security risks have multiplied. AI attacks are a new frontier in cybersecurity.

[Image: AI Shield and various AI attack titles]

What Are AI Attacks?

AI attacks are malicious activities that target AI systems and models.

Data Poisoning: Attackers manipulate training data to corrupt AI outputs.
Model Theft: Sensitive models and datasets can be stolen or reverse-engineered.
Adversarial Attacks: Malicious inputs can trick AI systems into making wrong decisions.
Privacy Breaches: Sensitive user data can leak if not properly protected.

[Image: AI Shield in front of a server rack]

Regulatory frameworks and industry standards now require organizations to adopt robust AI security practices to protect users, data, and critical infrastructure.

Tools and Techniques for Secure AI Infrastructure and Applications

Zero Trust Architecture
- Adopt a "never trust, always verify" approach.
- Enforce strict authentication and authorization for every user and device
Data Security Protocols
- Encrypt data at rest, in transit, and during processing.
- Use tools like Microsoft Purview for data classification, cataloging, and access control
Harden AI Models
- Train models with adversarial examples.
- Implement input validation, anomaly detection, and regular security assessments
Secure API and Endpoint Management
- Use API gateways, OAuth 2.0, and TLS to secure endpoints.
- Monitor and rate-limit API access to prevent abuse.
Continuous Monitoring and Incident Response
- Deploy AI-powered Security Information and Event Management (SIEM) systems for real-time threat detection and response
- Regularly audit logs and security events across your infrastructure.
DevSecOps Integration
- Embed security into every phase of the AI development lifecycle.
- Automate security testing in CI/CD pipelines.
Employee Training and Governance
- Train teams on AI-specific risks and responsible data handling.
- Establish clear governance frameworks for AI ethics and compliance
Azure-Specific Security Tools
- Microsoft Defender for Cloud: Monitors and protects Azure resources.
- Azure Resource Graph Explorer: Maintains inventory of models, data, and assets.
- Microsoft Purview: Manages data security, privacy, and compliance across Azure services.

Microsoft Purview provides a centralized platform for data governance, security, and compliance across your entire data estate.

Why Microsoft Purview Matters for Responsible AI

Microsoft Purview offers a unified, cloud-native solution for:

Data discovery and classification
Access management and policy enforcement
Compliance monitoring and risk mitigation
Data quality and observability

Purview's integrated approach ensures that AI systems are built on trusted, well-governed, and secure data, addressing the core principles of responsible AI: fairness, transparency, privacy, and accountability.

Conclusion

Responsible AI and strong AI security measures are no longer optional; they are essential pillars of modern application development and integration on Azure. By adhering to ethical principles and utilizing cutting-edge security tools and strategies, organizations can drive innovation with confidence while safeguarding users, data, and the broader society.

Navigating the New AI Landscape: A Developer’s Journey Through the Noise

KonstantinosPassadis — Mon, 16 Jun 2025 00:05:01 GMT

If you’ve opened your dev environment lately and felt like you were being chased by a parade of new frameworks, copilots, and orchestration platforms—you’re not alone. The pace of innovation is exhilarating, but even the most seasoned developer can feel like they're sprinting just to stay in place.

"But what if this overwhelming surge isn't chaos—but opportunity?"

In this post, I’ll share how .NET became my compass in the AI wilderness, how Microsoft’s ecosystem—from Semantic Kernel to GitHub Copilot and Azure’s Low Code tools—helped me stop reacting to change and start shaping it.

The Tool Rush: Friend or Foe?

Just a few years ago, developers wrestled with a handful of choices—framework A or B, cloud service X or Y. Now? We’re staring at a buffet of copilots, orchestration engines, agent frameworks, model registries, vector databases… you name it.

It’s a lot.

But here’s the thing: the surge in tools isn’t a crisis—it’s a sign of acceleration. Each new SDK or framework that launches is another attempt to close the gap between intention and execution. The goal isn’t more complexity—it’s more flexibility.

Take Microsoft’s ecosystem: it doesn’t ask developers to start from scratch. It asks, “What do you already know, and how can we build from there?”

.NET developers? Enter Semantic Kernel—use your C# skills to orchestrate intelligent workflows.
Prefer low-code? Power Platform and Azure AI Studio are building serious agentic capabilities without the learning curve.
Visual learners? GitHub Copilot and VS Code now surface suggestions that reflect your habits and project context, not just boilerplate code.

The key is to pick a starting point, not all the points. Let the tools orbit around your needs—not the other way around.

.NET + Semantic Kernel: A Familiar Face in New Territory

As developers, we find comfort in the familiar. And for many of us, that means .NET. So when AI began racing ahead—introducing concepts like “planners,” “memories,” and “skills”—the question was: How do I even start integrating this into my existing stack?

That’s where Semantic Kernel (SK) steps in—not as a replacement for what we know, but as an extension of it.

Semantic Kernel brings orchestration to the .NET world. You can build intelligent agents that combine:

Native C# functions, side-by-side with
AI-powered plugins, like OpenAI or Azure OpenAI, wrapped in clean abstractions.

Let’s make that real:

Imagine a customer support scenario where GitHub issues are piling up. With SK, you can build an agent that:

Reads incoming issues,
Classifies urgency and topic,
Routes them to the correct dev team,
Or even drafts suggested responses for review.

All while leveraging the APIs and data sources you already use in enterprise apps.

Plus, when paired with GitHub Copilot’s new Agent Mode, you’re not just writing orchestration logic—you’re collaborating with an AI that understands your .NET patterns, your repo context, and your intentions.

It’s like gaining a teammate that’s fluent in your language—code and otherwise.

A simple plugin for routing GitHub issues based on urgency:

[SKFunction("Classifies urgency of GitHub issue")] [SKFunctionName("IssueUrgencyClassifier")] public async Task<string> ClassifyUrgencyAsync(string issueTitle) { if (issueTitle.Contains("crash") || issueTitle.Contains("urgent")) return "High"; if (issueTitle.Contains("delay") || issueTitle.Contains("slow")) return "Medium"; return "Low"; }

Pair this with an OpenAI plugin in SK, and now you’ve got a hybrid agent that can reason and respond.

Live Context, Real Impact: MCP Server & Microsoft Fabric

If Semantic Kernel helps you think like an orchestrator, MCP Server helps you act like one—at scale.

At Build 2025, Microsoft showcased MCP as the brain behind adaptive AI systems. It handles:

Real-time signal processing, so your apps respond to new context on the fly,
Agent lifecycle management, ensuring copilots behave as expected across long-running tasks,
And cross-tool integration, making even complex AI systems easier to orchestrate.

A real-time data listener reacting to GitHub issue changes:

{ "eventType": "GitHub.Issue.Created", "filter": { "repository": "myorg/project-ai" }, "action": "Invoke-SemanticAgent", "context": ["issue.title", "issue.body"] }

MCP Server ensures every new issue becomes a trigger—not just a ticket in a queue.

Add Microsoft Fabric to the mix, and now you’ve got:

A unified data layer to store, query, and stream insights,
Built-in governance and security,
And seamless pipelines into tools like Power BI, Azure Synapse, and—yes—Copilot experiences.

A quick aggregation of issue volume by urgency over time:

from pyspark.sql.functions import window df = spark.read.load("fabric://github/issues") df.groupBy(window("created_at", "1 day"), "urgency").count().show()

This gives your low-code Power BI dashboard real, traceable, live context.

Scenario in action: A developer builds a low-code issue triage dashboard using Power Apps, connected to real-time GitHub data via MCP Server. When a spike in customer complaints arises:

Fabric surfaces insights immediately,
An SK-powered agent prioritizes tickets using embeddings + OpenAI,
And GitHub Copilot proposes hotfix code directly in VS Code.

All of it managed across services—with context flowing like water between them.

This is no longer about individual tools. It’s about the ecosystem working in harmony.

Wrapping It Up: Tools Change, Curiosity Endures

In the whirlwind of new frameworks, copilots, SDKs, and orchestration layers, it’s easy to feel like you’re chasing shadows. But this is the truth I’ve come to believe:

It’s not about learning everything—it’s about learning what moves you forward.

Microsoft’s ecosystem doesn’t demand mastery overnight. It offers on-ramps, whether you’re a .NET dev, a low-code builder, or someone just beginning their AI journey. From Semantic Kernel to GitHub Copilot, from MCP to Power Platform, each piece is a tile in a larger mosaic—one that’s built to meet you, not overwhelm you.

What Build 2025 proved is that we’re not witnessing the future of development—we’re shaping it, right now. One skill. One tool. One experiment at a time.

So let’s not ask “Which tool is best?” Let’s ask, “Which tool helps me create what matters?”

Because in the end, that’s what being a developer has always been about. And in this new era, the possibilities are finally catching up to our imaginations.

Learn and Integrate: Azure and AI Web Development

KonstantinosPassadis — Sat, 14 Jun 2025 23:01:39 GMT

Why Combine Azure and AI?

Microsoft Azure provides a robust ecosystem for deploying web apps, but its real power lies in its integrated AI services. By combining JavaScript-based frontends or Node.js backends with Azure’s AI capabilities, developers can:

Add natural language understanding with Azure OpenAI Service
Use Cognitive Services for vision, speech, and language features
Build chatbots via Azure Bot Services
Enable intelligent search with Azure Cognitive Search

These tools not only enhance user experience but also automate tasks, generate insights, and drive efficiency.

Getting Started: The Developer's Toolkit

If you're working with JavaScript, TypeScript, or popular frameworks like React, Angular, or Vue.js, Azure has you covered. Your journey might include:

Azure Static Web Apps: Host static sites with dynamic behavior using serverless APIs.
Azure Functions: Run backend logic or trigger AI processing without managing servers.
Azure AI Studio: A unified interface for creating and managing AI models and prompts.
OpenAI JavaScript SDK: Easily integrate models like GPT-4 or GPT-4o into your apps.

You can even use LangChain.js for advanced orchestration or build Retrieval-Augmented Generation (RAG) systems for context-aware AI.

Sample Code: AI-Powered Text Summarizer

Here’s a lightweight Azure Function using the OpenAI Node.js SDK to summarize text via GPT-4:

const { OpenAIClient, AzureKeyCredential } = require("@azure/openai"); module.exports = async function (context, req) { const endpoint = process.env["AZURE_OPENAI_ENDPOINT"]; const apiKey = process.env["AZURE_OPENAI_API_KEY"]; const client = new OpenAIClient(endpoint, new AzureKeyCredential(apiKey)); const input = req.body.text || "Summarize this placeholder text"; const deployment = "gpt-4"; const response = await client.getChatCompletions(deployment, [ { role: "system", content: "Summarize the following text:" }, { role: "user", content: input }, ]); context.res = { body: response.choices[0].message.content }; };

This function runs serverlessly on Azure, processes a POST request, and returns a summary—perfect for blogs, support tickets, or academic research tools.

Integration Diagram: AI-Infused Web App on Azure

[Client App (React/Vue)] │ ▼ [Azure Static Web App] │ ▼ [Azure Function API Layer] │ │ │ ├──> [Azure OpenAI (Text/Chat Completions)] │ └──> [Cognitive Services (Vision/Speech)] ▼ [Cosmos DB / Blob Storage (optional)]

This setup handles client interaction on the frontend, routes requests to serverless APIs, and taps into AI services for intelligent functionality—all hosted and scaled on Azure.

The Purpose

At the heart of this Learning Room lies a bold yet practical mission: to empower developers, learners, and innovators to build the modern digital experience using the power of Azure and AI.

This initiative is more than just a series of tutorials or tech showcases—it’s a space for discovery, experimentation, and collaboration. Whether you're an experienced developer or just starting to explore cloud and AI technologies, this Learning Room provides an inclusive environment to:

Understand the building blocks of cloud-native web development
Explore real-world integrations of AI into applications
Learn how to architect modern solutions using Azure’s scalable tools
Bridge the gap between theory and practice with hands-on projects and code examples

By focusing on the synergy between modern web development (JavaScript, APIs, serverless) and intelligent features (like natural language processing and computer vision), this Learning Room becomes your launchpad for building apps that are not only functional—but truly intelligent.

Real-World Inspiration

From travel assistants powered by GPT to AI-supported blogging platforms that generate content drafts, developers across industries are turning concepts into reality with Azure and JavaScript. These integrations are not only cutting-edge—they’re practical and scalable.

This is your space to learn, create, break things (in a good way), and rebuild with deeper understanding. In a world that's rapidly transforming, this Learning Room is your home base for staying ahead of the curve.

Enjoy the experience!