ollama

23 Topics

Engineering a Local-First Agentic Podcast Studio: A Deep Dive into Multi-Agent Orchestration
The transition from standalone Large Language Models (LLMs) to Agentic Orchestration marks the next frontier in AI development. We are moving away from simple "prompt-and-response" cycles toward a paradigm where specialized, autonomous units—AI Agents—collaborate to solve complex, multi-step problems. As a Technology Evangelist, my focus is on building these production-grade systems entirely on the edge, ensuring privacy, speed, and cost-efficiency. This technical guide explores the architecture and implementation of The AI Podcast Studio. This project demonstrates the seamless integration of the Microsoft Agent Framework, Local Small Language Models (SLMs), and VibeVoice to automate a complete tech podcast pipeline. I. The Strategic Intelligence Layer: Why Local-First? At the core of our studio is a Local-First philosophy. While cloud-based LLMs are powerful, they introduce friction in high-frequency, creative pipelines. By using Ollama as a model manager, we run SLMs like Qwen-3-8B directly on user hardware. 1. Architectural Comparison: Local vs. Cloud Choosing the deployment environment is a fundamental architectural decision. For an agentic podcasting workflow, the edge offers distinct advantages: Dimension Local Models (e.g., Qwen-3-8B) Cloud Models (e.g., GPT-4o) Latency Zero/Ultra-low: Instant token generation without network "jitter". Variable: Dependent on network stability and API traffic. Privacy Total Sovereignty: Creative data and drafts never leave the local device. Shared Risk: Data is processed on third-party servers. Cost Zero API Fees: One-time hardware investment; free to run infinite tokens. Pay-as-you-go: Costs scale with token count and frequency of calls. Availability Offline: The studio remains functional without an internet connection. Online Only: Requires a stable, high-speed connection. 2. Reasoning and Tool-Calling on the Edge To move beyond simple chat, we implement Reasoning Mode, utilizing Chain-of-Thought (CoT) prompting. This allows our local agents to "think" through the podcast structure before writing. Furthermore, we grant them "superpowers" through Tool-Calling, allowing them to execute Python functions for real-time web searches to gather the latest news. II. The Orchestration Engine: Microsoft Agent Framework The true complexity of this project lies in Agent Orchestration—the coordination of specialized agents to work as a cohesive team. We distinguish between Agents, who act as "Jazz Musicians" making flexible decisions, and Workflows, which act as the "Orchestra" following a predefined score. 1. Advanced Orchestration Patterns Drawing from the WorkshopForAgentic architecture, the studio utilizes several sophisticated patterns: Sequential: A strict pipeline where the output of the Researcher flows into the Scriptwriter. Concurrent (Parallel): Multiple agents search different news sources simultaneously to speed up data gathering. Handoff: An agent dynamically "transfers" control to another specialist based on the context of the task. Magentic-One: A high-level "Manager" agent decides which specialist should handle the next task in real-time. III. Implementation: Code Analysis (Workshop Patterns) To maintain a production-grade codebase, we follow the modular structure found in the WorkshopForAgentic/code directory. This ensures that agents, clients, and workflows are decoupled and maintainable. 1. Configuration: Connecting to Local SLMs The first step is initializing the local model client using the framework's Ollama integration. # Based on WorkshopForAgentic/code/config.py from agent_framework.ollama import OllamaChatClient # Initialize the local client for Qwen-3-8B # Standard Ollama endpoint on localhost chat_client = OllamaChatClient( model_id="qwen3:8b", endpoint="http://localhost:11434" ) 2. Agent Definition: Specialized Roles Each agent is a ChatAgent instance defined by its persona and instructions. # Based on WorkshopForAgentic/code/agents.py from agent_framework import ChatAgent # The Researcher Agent: Responsible for web discovery researcher_agent = client.create_agent( name="SearchAgent", instructions="You are my assistant. Answer the questions based on the search engine.", tools=[web_search], ) # The Scriptwriter Agent: Responsible for conversational narrative generate_script_agent = client.create_agent( name="GenerateScriptAgent", instructions=""" You are my podcast script generation assistant. Please generate a 10-minute Chinese podcast script based on the provided content. The podcast script should be co-hosted by Lucy (the host) and Ken (the expert). The script content should be generated based on the input, and the final output format should be as follows: Speaker 1: …… Speaker 2: …… Speaker 1: …… Speaker 2: …… Speaker 1: …… Speaker 2: …… """ ) 3. Workflow Setup: The Sequential Pipeline For a deterministic production line, we use the WorkflowBuilder to connect our agents. # Based on WorkshopForAgentic/code/workflow_setup.py from agent_framework import WorkflowBuilder # Building the podcast pipeline search_executor = AgentExecutor(agent=search_agent, id="search_executor") gen_script_executor = AgentExecutor(agent=gen_script_agent, id="gen_script_executor") review_executor = ReviewExecutor(id="review_executor", genscript_agent_id="gen_script_executor") # Build workflow with approval loop # search_executor -> gen_script_executor -> review_executor # If not approved, review_executor -> gen_script_executor (loop back) workflow = ( WorkflowBuilder() .set_start_executor(search_executor) .add_edge(search_executor, gen_script_executor) .add_edge(gen_script_executor, review_executor) .add_edge(review_executor, gen_script_executor) # Loop back for regeneration .build() ) IV. Multimodal Synthesis: VibeVoice Technology The "Future Bytes" podcast is brought to life using VibeVoice, a specialized technology from Microsoft Research designed for natural conversational synthesis. Conversational Rhythm: It automatically handles natural turn-taking and speech cadences. High Efficiency: By operating at an ultra-low 7.5 Hz frame rate, it significantly reduces the compute power required for high-fidelity audio. Scalability: The system supports up to 4 distinct voices and can generate up to 90 minutes of continuous audio. V. Observability and Debugging: DevUI Building multi-agent systems requires deep visibility into the agentic "thinking" process. We leverage DevUI, a specialized web interface for testing and tracing: Interactive Tracing: Developers can watch the message flow and tool-calling in real-time. Automatic Discovery: DevUI auto-discovers agents defined within the project structure. Input Auto-Generation: The UI generates input fields based on workflow requirements, allowing for rapid iteration. VI. Technical Requirements for Edge Deployment Deploying this studio locally requires specific hardware and software configurations to handle simultaneous LLM and TTS inference: Software: Python 3.10+, Ollama, and the Microsoft Agent Framework. Hardware: 16GB+ RAM is the minimum requirement; 32GB is recommended for running multiple agents and VibeVoice concurrently. Compute: A modern GPU/NPU (e.g., NVIDIA RTX or Snapdragon X Elite) is essential for smooth inference. Final Perspective: From Coding to Directing The AI Podcast Studio represents a significant shift toward Agentic Content Creation. By mastering these orchestration patterns and leveraging local EdgeAI, developers move from simply writing code to directing entire ecosystems of intelligent agents. This "local-first" model ensures that the future of creativity is private, efficient, and infinitely scalable. Download sample Here Resource EdgeAI for Beginners - https://github.com/microsoft/edgeai-for-beginners Microsoft Agent Framework - https://github.com/microsoft/agent-framework Microsoft Agent Framework Samples - https://github.com/microsoft/agent-framework-samples
kinfey
Jan 25, 2026 Place Microsoft Developer Community Blog
777Views
0likes
0Comments
Engineering a Local-First Agentic Podcast Studio: A Deep Dive into Multi-Agent Orchestration
The transition from standalone Large Language Models (LLMs) to Agentic Orchestration marks the next frontier in AI development. We are moving away from simple "prompt-and-response" cycles toward a paradigm where specialized, autonomous units—AI Agents—collaborate to solve complex, multi-step problems. As a Technology Evangelist, my focus is on building these production-grade systems entirely on the edge, ensuring privacy, speed, and cost-efficiency. This technical guide explores the architecture and implementation of The AI Podcast Studio. This project demonstrates the seamless integration of the Microsoft Agent Framework, Local Small Language Models (SLMs), and VibeVoice to automate a complete tech podcast pipeline. I. The Strategic Intelligence Layer: Why Local-First? At the core of our studio is a Local-First philosophy. While cloud-based LLMs are powerful, they introduce friction in high-frequency, creative pipelines. By using Ollama as a model manager, we run SLMs like Qwen-3-8B directly on user hardware. 1. Architectural Comparison: Local vs. Cloud Choosing the deployment environment is a fundamental architectural decision. For an agentic podcasting workflow, the edge offers distinct advantages: Dimension Local Models (e.g., Qwen-3-8B) Cloud Models (e.g., GPT-5.2) Latency Zero/Ultra-low: Instant token generation without network "jitter". Variable: Dependent on network stability and API traffic. Privacy Total Sovereignty: Creative data and drafts never leave the local device. Shared Risk: Data is processed on third-party servers. Cost Zero API Fees: One-time hardware investment; free to run infinite tokens. Pay-as-you-go: Costs scale with token count and frequency of calls. Availability Offline: The studio remains functional without an internet connection. Online Only: Requires a stable, high-speed connection. 2. Reasoning and Tool-Calling on the Edge To move beyond simple chat, we implement Reasoning Mode, utilizing Chain-of-Thought (CoT) prompting. This allows our local agents to "think" through the podcast structure before writing. Furthermore, we grant them "superpowers" through Tool-Calling, allowing them to execute Python functions for real-time web searches to gather the latest news. II. The Orchestration Engine: Microsoft Agent Framework The true complexity of this project lies in Agent Orchestration—the coordination of specialized agents to work as a cohesive team. We distinguish between Agents, who act as "Jazz Musicians" making flexible decisions, and Workflows, which act as the "Orchestra" following a predefined score. 1. Advanced Orchestration Patterns Drawing from the WorkshopForAgentic architecture, the studio utilizes several sophisticated patterns: Sequential: A strict pipeline where the output of the Researcher flows into the Scriptwriter. Concurrent (Parallel): Multiple agents search different news sources simultaneously to speed up data gathering. Handoff: An agent dynamically "transfers" control to another specialist based on the context of the task. Magentic-One: A high-level "Manager" agent decides which specialist should handle the next task in real-time. III. Implementation: Code Analysis (Workshop Patterns) To maintain a production-grade codebase, we follow the modular structure found in the WorkshopForAgentic/code directory. This ensures that agents, clients, and workflows are decoupled and maintainable. 1. Configuration: Connecting to Local SLMs The first step is initializing the local model client using the framework's Ollama integration. # Based on WorkshopForAgentic/code/config.py from agent_framework.ollama import OllamaChatClient # Initialize the local client for Qwen-3-8B # Standard Ollama endpoint on localhost chat_client = OllamaChatClient( model_id="qwen3:8b", endpoint="http://localhost:11434" ) 2. Agent Definition: Specialized Roles Each agent is a ChatAgent instance defined by its persona and instructions. # Based on WorkshopForAgentic/code/agents.py from agent_framework import ChatAgent # The Researcher Agent: Responsible for web discovery researcher_agent = client.create_agent( name="SearchAgent", instructions="You are my assistant. Answer the questions based on the search engine.", tools=[web_search], ) # The Scriptwriter Agent: Responsible for conversational narrative generate_script_agent = client.create_agent( name="GenerateScriptAgent", instructions=""" You are my podcast script generation assistant. Please generate a 10-minute Chinese podcast script based on the provided content. The podcast script should be co-hosted by Lucy (the host) and Ken (the expert). The script content should be generated based on the input, and the final output format should be as follows: Speaker 1: …… Speaker 2: …… Speaker 1: …… Speaker 2: …… Speaker 1: …… Speaker 2: …… """ ) 3. Workflow Setup: The Sequential Pipeline For a deterministic production line, we use the WorkflowBuilder to connect our agents. # Based on WorkshopForAgentic/code/workflow_setup.py from agent_framework import WorkflowBuilder # Building the podcast pipeline search_executor = AgentExecutor(agent=search_agent, id="search_executor") gen_script_executor = AgentExecutor(agent=gen_script_agent, id="gen_script_executor") review_executor = ReviewExecutor(id="review_executor", genscript_agent_id="gen_script_executor") # Build workflow with approval loop # search_executor -> gen_script_executor -> review_executor # If not approved, review_executor -> gen_script_executor (loop back) workflow = ( WorkflowBuilder() .set_start_executor(search_executor) .add_edge(search_executor, gen_script_executor) .add_edge(gen_script_executor, review_executor) .add_edge(review_executor, gen_script_executor) # Loop back for regeneration .build() ) IV. Multimodal Synthesis: VibeVoice Technology The "Future Bytes" podcast is brought to life using VibeVoice, a specialized technology from Microsoft Research designed for natural conversational synthesis. Conversational Rhythm: It automatically handles natural turn-taking and speech cadences. High Efficiency: By operating at an ultra-low 7.5 Hz frame rate, it significantly reduces the compute power required for high-fidelity audio. Scalability: The system supports up to 4 distinct voices and can generate up to 90 minutes of continuous audio. V. Observability and Debugging: DevUI Building multi-agent systems requires deep visibility into the agentic "thinking" process. We leverage DevUI, a specialized web interface for testing and tracing: Interactive Tracing: Developers can watch the message flow and tool-calling in real-time. Automatic Discovery: DevUI auto-discovers agents defined within the project structure. Input Auto-Generation: The UI generates input fields based on workflow requirements, allowing for rapid iteration. VI. Technical Requirements for Edge Deployment Deploying this studio locally requires specific hardware and software configurations to handle simultaneous LLM and TTS inference: Software: Python 3.10+, Ollama, and the Microsoft Agent Framework. Hardware: 16GB+ RAM is the minimum requirement; 32GB is recommended for running multiple agents and VibeVoice concurrently. Compute: A modern GPU/NPU (e.g., NVIDIA RTX or Snapdragon X Elite) is essential for smooth inference. Final Perspective: From Coding to Directing The AI Podcast Studio represents a significant shift toward Agentic Content Creation. By mastering these orchestration patterns and leveraging local EdgeAI, developers move from simply writing code to directing entire ecosystems of intelligent agents. This "local-first" model ensures that the future of creativity is private, efficient, and infinitely scalable. Download sample Here Resource EdgeAI for Beginners - https://github.com/microsoft/edgeai-for-beginners Microsoft Agent Framework - https://github.com/microsoft/agent-framework Microsoft Agent Framework Samples - https://github.com/microsoft/agent-framework-samples
kinfey
Jan 05, 2026 Place Microsoft Developer Community Blog
8.6KViews
3likes
0Comments
From Cloud to Chip: Building Smarter AI at the Edge with Windows AI PCs
As AI engineers, we’ve spent years optimizing models for the cloud, scaling inference, wrangling latency, and chasing compute across clusters. But the frontier is shifting. With the rise of Windows AI PCs and powerful local accelerators, the edge is no longer a constraint it’s now a canvas. Whether you're deploying vision models to industrial cameras, optimizing speech interfaces for offline assistants, or building privacy-preserving apps for healthcare, Edge AI is where real-world intelligence meets real-time performance. Why Edge AI, Why Now? Edge AI isn’t just about running models locally, it’s about rethinking the entire lifecycle: - Latency: Decisions in milliseconds, not round-trips to the cloud. - Privacy: Sensitive data stays on-device, enabling HIPAA/GDPR compliance. - Resilience: Offline-first apps that don’t break when the network does. - Cost: Reduced cloud compute and bandwidth overhead. With Windows AI PCs powered by Intel and Qualcomm NPUs and tools like ONNX Runtime, DirectML, and Olive, developers can now optimize and deploy models with unprecedented efficiency. What You’ll Learn in Edge AI for Beginners The Edge AI for Beginners curriculum is a hands-on, open-source guide designed for engineers ready to move from theory to deployment. Multi-Language Support This content is available in over 48 languages, so you can read and study in your native language. What You'll Master This course takes you from fundamental concepts to production-ready implementations, covering: Small Language Models (SLMs) optimized for edge deployment Hardware-aware optimization across diverse platforms Real-time inference with privacy-preserving capabilities Production deployment strategies for enterprise applications Why EdgeAI Matters Edge AI represents a paradigm shift that addresses critical modern challenges: Privacy & Security: Process sensitive data locally without cloud exposure Real-time Performance: Eliminate network latency for time-critical applications Cost Efficiency: Reduce bandwidth and cloud computing expenses Resilient Operations: Maintain functionality during network outages Regulatory Compliance: Meet data sovereignty requirements Edge AI Edge AI refers to running AI algorithms and language models locally on hardware, close to where data is generated without relying on cloud resources for inference. It reduces latency, enhances privacy, and enables real-time decision-making. Core Principles: On-device inference: AI models run on edge devices (phones, routers, microcontrollers, industrial PCs) Offline capability: Functions without persistent internet connectivity Low latency: Immediate responses suited for real-time systems Data sovereignty: Keeps sensitive data local, improving security and compliance Small Language Models (SLMs) SLMs like Phi-4, Mistral-7B, Qwen and Gemma are optimized versions of larger LLMs, trained or distilled for: Reduced memory footprint: Efficient use of limited edge device memory Lower compute demand: Optimized for CPU and edge GPU performance Faster startup times: Quick initialization for responsive applications They unlock powerful NLP capabilities while meeting the constraints of: Embedded systems: IoT devices and industrial controllers Mobile devices: Smartphones and tablets with offline capabilities IoT Devices: Sensors and smart devices with limited resources Edge servers: Local processing units with limited GPU resources Personal Computers: Desktop and laptop deployment scenarios Course Modules & Navigation Course duration. 10 hours of content Module Topic Focus Area Key Content Level Duration 📖 00 Introduction to EdgeAI Foundation & Context EdgeAI Overview • Industry Applications • SLM Introduction • Learning Objectives Beginner 1-2 hrs 📚 01 EdgeAI Fundamentals Cloud vs Edge AI comparison EdgeAI Fundamentals • Real World Case Studies • Implementation Guide • Edge Deployment Beginner 3-4 hrs 🧠 02 SLM Model Foundations Model families & architecture Phi Family • Qwen Family • Gemma Family • BitNET • μModel • Phi-Silica Beginner 4-5 hrs 🚀 03 SLM Deployment Practice Local & cloud deployment Advanced Learning • Local Environment • Cloud Deployment Intermediate 4-5 hrs ⚙️ 04 Model Optimization Toolkit Cross-platform optimization Introduction • Llama.cpp • Microsoft Olive • OpenVINO • Apple MLX • Workflow Synthesis Intermediate 5-6 hrs 🔧 05 SLMOps Production Production operations SLMOps Introduction • Model Distillation • Fine-tuning • Production Deployment Advanced 5-6 hrs 🤖 06 AI Agents & Function Calling Agent frameworks & MCP Agent Introduction • Function Calling • Model Context Protocol Advanced 4-5 hrs 💻 07 Platform Implementation Cross-platform samples AI Toolkit • Foundry Local • Windows Development Advanced 3-4 hrs 🏭 08 Foundry Local Toolkit Production-ready samples Sample applications (see details below) Expert 8-10 hrs Each module includes Jupyter notebooks, code samples, and deployment walkthroughs, perfect for engineers who learn by doing. Developer Highlights - 🔧 Olive: Microsoft's optimization toolchain for quantization, pruning, and acceleration. - 🧩 ONNX Runtime: Cross-platform inference engine with support for CPU, GPU, and NPU. - 🎮 DirectML: GPU-accelerated ML API for Windows, ideal for gaming and real-time apps. - 🖥️ Windows AI PCs: Devices with built-in NPUs for low-power, high-performance inference. Local AI: Beyond the Edge Local AI isn’t just about inference, it’s about autonomy. Imagine agents that: - Learn from local context - Adapt to user behavior - Respect privacy by design With tools like Agent Framework, Azure AI Foundry and Windows Copilot Studio, and Foundry Local developers can orchestrate local agents that blend LLMs, sensors, and user preferences, all without cloud dependency. Try It Yourself Ready to get started? Clone the Edge AI for Beginners GitHub repo, run the notebooks, and deploy your first model to a Windows AI PC or IoT devices Whether you're building smart kiosks, offline assistants, or industrial monitors, this curriculum gives you the scaffolding to go from prototype to production.
Lee_Stott
Dec 23, 2025 Place Microsoft Developer Community Blog
772Views
1like
2Comments
Python + IA: Resumen y Recursos
Acabamos de concluir nuestra serie sobre Python + IA, un recorrido completo de nueve sesiones donde exploramos a fondo cómo usar modelos de inteligencia artificial generativa desde Python. Durante la serie presentamos varios tipos de modelos, incluyendo LLMs, modelos de embeddings y modelos de visión. Profundizamos en técnicas populares como RAG, tool calling y salidas estructuradas. Evaluamos la calidad y seguridad de la IA mediante evaluaciones automatizadas y red-teaming. Finalmente, desarrollamos agentes de IA con frameworks populares de Python y exploramos el nuevo Model Context Protocol (MCP). Para que puedas aplicar lo aprendido, todos nuestros ejemplos funcionan con GitHub Models, un servicio que ofrece modelos gratuitos a todos los usuarios de GitHub para experimentación y aprendizaje. Aunque no hayas asistido a las sesiones en vivo, ¡todavía puedes acceder a todos los materiales usando los enlaces de abajo! Si eres instructor, puedes usar las diapositivas y el código en tus propias clases. Python + IA: Modelos de Lenguaje Grandes (LLMs) 📺 Ver grabación En esta sesión exploramos los LLMs, los modelos que impulsan ChatGPT y GitHub Copilot. Usamos Python con paquetes como OpenAI SDK y LangChain, experimentamos con prompt engineering y ejemplos few-shot, y construimos una aplicación completa basada en LLMs. También explicamos la importancia de la concurrencia y el streaming en apps de IA. Diapositivas: aka.ms/pythonia/diapositivas/llms Código: python-openai-demos Guía de repositorio: video Python + IA: Embeddings Vectoriales 📺 Ver grabación En nuestra segunda sesión, aprendemos sobre los modelos de embeddings vectoriales, que convierten texto o imágenes en arreglos numéricos. Comparamos métricas de distancia, aplicamos cuantización y experimentamos con modelos multimodales. Diapositivas: aka.ms/pythonia/diapositivas/embeddings Código: vector-embedding-demos Guía de repositorio: video Python + IA: Retrieval Augmented Generation (RAG) 📺 Ver grabación Descubrimos cómo usar RAG para mejorar las respuestas de los LLMs añadiendo contexto relevante. Construimos flujos RAG en Python con distintas fuentes (CSVs, sitios web, documentos y bases de datos) y terminamos con una aplicación completa basada en Azure AI Search. Diapositivas: aka.ms/pythonia/diapositivas/rag Código: python-openai-demos Guía de repositorio: video Python + IA: Modelos de Visión 📺 Ver grabación Los modelos de visión aceptan texto e imágenes, como GPT-4o y GPT-4o mini. Creamos una app de chat con imágenes, realizamos extracción de datos y construimos un motor de búsqueda multimodal. Diapositivas: aka.ms/pythonia/diapositivas/vision Código: vector-embeddings Guía de repositorio: video Python + IA: Salidas Estructuradas 📺 Ver grabación Aprendemos a generar respuestas estructuradas con LLMs usando Pydantic BaseModel. Este enfoque permite validación automática de los resultados, útil para extracción de entidades, clasificación y flujos de agentes. Diapositivas: aka.ms/pythonia/diapositivas/salidas Código: python-openai-demos y entity-extraction-demos Guía de repositorio: video Python + IA: Calidad y Seguridad 📺 Ver grabación Analizamos cómo usar la IA de forma segura y cómo evaluar la calidad de las respuestas. Mostramos cómo configurar Azure AI Content Safety y usar el Azure AI Evaluation SDK para medir resultados de los modelos. Diapositivas: aka.ms/pythonia/diapositivas/calidad Código: ai-quality-safety-demos Guía de repositorio: video Python + IA: Tool Calling 📺 Ver grabación Exploramos el tool calling, base para crear agentes de IA. Definimos herramientas con esquemas JSON y funciones Python, manejamos llamadas paralelas y flujos iterativos. Diapositivas: aka.ms/pythonia/diapositivas/herramientas Código: python-openai-demos Guía de repositorio: video Python + IA: Agentes de IA 📺 Ver grabación Creamos agentes de IA con frameworks como el agent-framework de Microsoft y LangGraph, mostrando arquitecturas con múltiples herramientas, supervisores y flujos con intervención humana. Diapositivas: aka.ms/pythonia/diapositivas/agentes Código: python-ai-agents-demos Guía de repositorio: video Python + IA: Model Context Protocol (MCP) 📺 Ver grabación Cerramos la serie con MCP (Model Context Protocol), la tecnología más innovadora de 2025. Mostramos cómo usar el SDK de FastMCP en Python para crear un servidor MCP local, conectarlo a GitHub Copilot, construir un cliente MCP y conectar frameworks como LangGraph y agent-framework. También discutimos los riesgos de seguridad asociados. Diapositivas: aka.ms/pythonia/diapositivas/mcp Código: python-ai-mcp-demos Guía de repositorio: video Además Si tienen preguntas, por favor, en el canal #Espanol en nuestro Discord: https://aka.ms/pythonia/discord Todos los jueves tengo office hours: https://aka.ms/pythonia/horas Encuentra más tutoriales 100% en español sobre Python + AI en https://youtube.com/@lagps
gwyneth-penas
Nov 03, 2025 Place Microsoft Developer Community Blog
194Views
0likes
0Comments
Ollama on HTTPS for SQL Server
Here is a quick procedure to deploy an Ubuntu container with Ollama and expose its API over HTTPS. The goal is to allow a fast deployment, even for those unfamiliar with Docker or Language Models, making it easy to set up an offline platform for generating embeddings and using Small Language Models This is particularly useful when testing SQL Server 2025 for fully on-premises environment use cases, since SQL Server only allows access to HTTPS endpoints. However, HTTP remains open for testing purposes. Please note that this example is CPU-based, as deploying with (integrated) GPU support involves additional, less straightforward steps. This example is provided solely to illustrate the concept, is not intended for production use, and comes without any guarantee of performance or security. Prerequisites To continue, you need to have Docker Desktop, WSL and SQL Server 2025 (currently Release Candidate 1) Docker Desktop Install WSL | Microsoft Learn SQL Server 2025 Preview | Microsoft Evaluation Center Create a Dockerfile First, create a working directory. In this example, C:\Docker\Ollama will be used. Simply create a file named Dockerfile (without an extension) and paste the following content into it. FROM ubuntu:25.10 RUN apt update && apt install -y curl gnupg2 ca-certificates lsb-release apt-transport-https software-properties-common unzip nano openssl net-tools RUN curl -fsSL https://ollama.com/install.sh | bash RUN curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg RUN curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | tee /etc/apt/sources.list.d/caddy-stable.list RUN apt update && apt install -y caddy RUN mkdir -p /etc/caddy/certs RUN cat > /etc/caddy/certs/san.cnf <<EOF [req] default_bits = 2048 prompt = no default_md = sha256 req_extensions = req_ext distinguished_name = dn [dn] CN = 127.0.0.1 [req_ext] subjectAltName = @alt_names [alt_names] IP.1 = 127.0.0.1 DNS.1 = localhost EOF RUN openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /etc/caddy/certs/localhost.key -out /etc/caddy/certs/localhost.crt -config /etc/caddy/certs/san.cnf -extensions req_ext RUN echo "https://:443 {\n tls /etc/caddy/certs/localhost.crt /etc/caddy/certs/localhost.key\n reverse_proxy localhost:11434\n}" >> /etc/caddy/Caddyfile RUN echo "#!/bin/bash" > /usr/local/bin/entrypoint.sh && \ echo "set -e" >> /usr/local/bin/entrypoint.sh && \ echo "OLLAMA_HOST=0.0.0.0 ollama serve >> /var/log/ollama.log 2>&1 &" >> /usr/local/bin/entrypoint.sh && \ echo "caddy run --config /etc/caddy/Caddyfile --adapter caddyfile >> /var/log/caddy.log 2>&1 &" >> /usr/local/bin/entrypoint.sh && \ echo "tail -f /var/log/ollama.log /var/log/caddy.log" >> /usr/local/bin/entrypoint.sh && \ chmod 755 /usr/local/bin/entrypoint.sh ENTRYPOINT ["/usr/local/bin/entrypoint.sh"] For your information, this file allows the creation of an image based on Ubuntu 25.10 and includes: Ollama, for running the models Caddy, for the reverse proxy Creation of a certificate for the HTTPS endpoint on localhost Create the container After opening a Powershell terminal, execute the following commands: cd C:\Docker\Ollama #Build the image from the Dockerfile. docker build -t ollama-https . #Create a container based on the image ollama-https docker run --name ollama-https -d -it -p 443:443 -p 11434:11434 ollama-https #Copy the certificate created into the current Windows directory docker cp ollama-https:/etc/caddy/certs/localhost.crt . # Install the certificate in Trusted Root Certification Authorities Import-Certificate -FilePath "localhost.crt" -CertStoreLocation "Cert:\LocalMachine\Root" #Check Https (wget https://localhost).Content #Check Http (wget http://localhost:11434).Content Ollama is now running With a browser, connect to https://localhost Retrieve Models No model is retrieved when the image is created, as this depends on each use case, and for some models, the size can be substantial. Here’s a quick example for pulling an embedding model, Nomic, and a small language model, Phi3. Ollama Search docker exec ollama-https ollama pull nomic-embed-text docker exec ollama-https ollama pull phi3:mini A quick example with SQL Server 2025 A quick demonstration using the WideWorldImporters database (Wide World Importers sample database) use [master] GO ALTER DATABASE WideWorldImporters SET COMPATIBILITY_LEVEL = 170 WITH ROLLBACK IMMEDIATE GO DBCC TRACEON(466, 474, 13981, -1) GO Note: With RC1, you can use the PREVIEW_FEATURES database-scoped configuration T-SQL Declare an external model for embeddings. use [WideWorldImporters] GO CREATE EXTERNAL MODEL NomicLocal AUTHORIZATION dbo WITH ( LOCATION = 'https://localhost/api/embed', API_FORMAT = 'ollama', MODEL_TYPE = EMBEDDINGS, MODEL = 'nomic-embed-text' ) to enable semantic search capabilities on StockItems, we will create a dedicated table to store embeddings (no chunking in this example) along with a vector index optimized for cosine similarity use [WideWorldImporters] GO CREATE TABLE [Warehouse].[StockItemsEmbedding](StockItemEmbeddingID int identity (1,1) PRIMARY KEY, StockItemId int, SearchDetails nvarchar(max), Embedding vector(768)) GO INSERT INTO [Warehouse].[StockItemsEmbedding] SELECT si.StockItemID, si.SearchDetails, AI_GENERATE_EMBEDDINGS(si.SearchDetails USE MODEL NomicLocal) /* Generate embeddings from declared external model */ FROM [Warehouse].[StockItems] si GO /* Check */ SELECT * FROM [Warehouse].[StockItemsEmbedding] GO CREATE VECTOR INDEX IXV_1 ON [Warehouse].[StockItemsEmbedding] (Embedding) WITH (METRIC = 'cosine', TYPE = 'DiskANN') GO /* User Input */ DECLARE @UserInput varchar(max) = 'Which product is best suited for shipping small items?' /* and Generate embeddings for user input */ DECLARE @UserInputV vector(768) = AI_GENERATE_EMBEDDINGS(@UserInput USE MODEL NomicLocal) DECLARE @ModelInput nvarchar(max) DECLARE Payload nvarchar(max) DECLARE Response nvarchar(max) /* Similarity Search on StockItems and Model Input creation*/ SELECT @ModelInput = STRING_AGG('ProductDetails: ' + sie.SearchDetails + 'UnitPrice: ' + CAST(si.UnitPrice AS nvarchar(max)), ' \n\n') FROM VECTOR_SEARCH( TABLE = [Warehouse].[StockItemsEmbedding] as sie, COLUMN = Embedding, SIMILAR_TO = @UserInputV, METRIC = 'cosine', TOP_N = 10 ) JOIN [Warehouse].[StockItems] si ON si.StockItemId = sie.StockItemId /* Generate payload for response generation */ SELECT = '{"model": "phi3:mini", "stream": false, "prompt":"You are acting as a customer advisor responsible for recommending the most suitable products based on customer needs, providing clear and personalized suggestions. Question : ' + @UserInput + '\n\nList of Items : ' + @ModelInput + '"}'; EXECUTE sp_invoke_external_rest_endpoint @url = 'https://localhost/api/generate', @method = 'POST', = , @timeout = 230, = OUTPUT; PRINT JSON_VALUE(@response, '$.result.response') LangChain You can also have a try with LangChain. Same demo with a small difference, there is no vector index created on the vector store table. The table has been modified, but only for demonstration purposes. Reference: SQLServer | 🦜️🔗 LangChain # PREREQ #sudo apt-get update && sudo apt-get install -y unixodbc # sudo apt-get update # sudo apt-get install -y curl gnupg2 # curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add - # curl https://packages.microsoft.com/config/debian/11/prod.list | sudo tee /etc/apt/sources.list.d/mssql-release.list # sudo apt-get update # sudo ACCEPT_EULA=Y apt-get install -y msodbcsql18 # pip3 install langchain langchain-sqlserver langchain-ollama langchain-community import pyodbc from langchain_sqlserver import SQLServer_VectorStore from langchain_ollama import OllamaEmbeddings from langchain_ollama import ChatOllama from langchain.schema import Document from langchain_community.vectorstores.utils import DistanceStrategy #Prompt for testing _USER_INPUT = 'Which product is best suited for shipping small items?' ############### Params ########################################## print("\033[93mSetting up variables...\033[0m") _SQL_DRIVER = "ODBC Driver 18 for SQL Server" _SQL_SERVER = "localhost\\SQL2K25" _SQL_DATABASE = "WideWorldImporters" _SQL_USERNAME = "lc" _SQL_PASSWORD = "lc" _SQL_TRUST_CERT = "yes" _SQL_VECTOR_STORE_TABLE = "StockItem_VectorStore" # Table name for vector storage _MODIFY_TABLE_TO_USE_SQL_VECTOR_INDEX = True #As vector index not considered currently in langchain and structure does not match vector index requirements _CONNECTION_STRING = f"Driver={{{_SQL_DRIVER}}};Server={_SQL_SERVER};Database={_SQL_DATABASE};UID={_SQL_USERNAME};PWD={_SQL_PASSWORD};TrustServerCertificate={_SQL_TRUST_CERT}" _OLLAMA_API_URL = "https://localhost" _OLLAMA_EMBEDDING_MODEL = "nomic-embed-text:latest" _OLLAMA_EMBEDDING_VECTOR_SIZE = 768 _OLLAMA_SLM_MODEL = "phi3:mini" # Model for SLM queries ################################################################### #Define Ollama embeddings embeddings = OllamaEmbeddings( model=_OLLAMA_EMBEDDING_MODEL, base_url=_OLLAMA_API_URL ) conn = pyodbc.connect(_CONNECTION_STRING) cursor = conn.cursor() #Drop embeddings table if it exists print("\033[93mDropping existing vector store table if it exists...\033[0m") cursor.execute(f"DROP TABLE IF EXISTS Warehouse.{_SQL_VECTOR_STORE_TABLE};") print("\033[93mConnecting to SQL Server and fetching data...\033[0m") cursor.execute("SELECT StockItemId, SearchDetails, UnitPrice FROM Warehouse.StockItems;") rows = cursor.fetchall() print(f"\033[93mFound {len(rows)} records to process\033[0m") # Create documents from the fetched data documents = [ Document( page_content=row.SearchDetails, metadata={ "StockItemId": row.StockItemId, "UnitPrice": float(row.UnitPrice) # Convert Decimal to float } ) for row in rows ] conn.commit() #Creating vector store print("\033[93mCreating vector store...\033[0m") vector_store = SQLServer_VectorStore( connection_string=_CONNECTION_STRING, distance_strategy=DistanceStrategy.COSINE, # If not provided, defaults to COSINE embedding_function=embeddings, embedding_length=_OLLAMA_EMBEDDING_VECTOR_SIZE, db_schema = "Warehouse", table_name=_SQL_VECTOR_STORE_TABLE ) print("\033[93mAdding to vector store...\033[0m") try: vector_store.add_documents(documents) print("\033[93mSuccessfully added to vector store!\033[0m") except Exception as e: print(f"\033[91mError adding documents: {e}\033[0m") #Vector index not yet integrated in SQL Server VectorStore (drop auto-created nonclustered PK and generating int clustered PK if (_MODIFY_TABLE_TO_USE_SQL_VECTOR_INDEX): print("\033[93mModifying structure to create vector index...\033[0m") cursor.execute("DECLARE @AutoCreatedPK sysname, @SQL nvarchar(max);" f"SELECT @AutoCreatedPK = name FROM sys.key_constraints WHERE type = 'PK' AND parent_object_id = object_id('Warehouse.{_SQL_VECTOR_STORE_TABLE}');" f"SELECT @SQL = 'ALTER TABLE Warehouse.{_SQL_VECTOR_STORE_TABLE} DROP CONSTRAINT ' + @AutoCreatedPK + ';'" "EXEC sp_executesql @SQL;" f"ALTER TABLE Warehouse.{_SQL_VECTOR_STORE_TABLE} ADD Alt_Id int identity(1,1);" f"ALTER TABLE Warehouse.{_SQL_VECTOR_STORE_TABLE} ADD CONSTRAINT PK_{_SQL_VECTOR_STORE_TABLE} PRIMARY KEY (Alt_Id);") conn.commit() print("\033[93mCreating vector index...\033[0m") cursor.execute(f"CREATE VECTOR INDEX IV_{_SQL_VECTOR_STORE_TABLE} ON [Warehouse].[{_SQL_VECTOR_STORE_TABLE}] (embeddings) WITH (METRIC = 'cosine', TYPE = 'DiskANN');") conn.commit() #Generate prompt then answer print(f"\033[92mUser Input: {_USER_INPUT}\033[0m") context = [ { "Item": doc.page_content, "UnitPrice": doc.metadata.get("UnitPrice", None) } for doc in vector_store.similarity_search(_USER_INPUT, k=3) ] llm = ChatOllama(model=_OLLAMA_SLM_MODEL,base_url=_OLLAMA_API_URL) prompt = ( f"You are acting as a customer advisor responsible for recommending the most suitable products based on customer needs, providing clear and personalized suggestions" f"Context: {context}\n\nQuestion: {_USER_INPUT}\n\n") response = llm.invoke(prompt) print(f"\033[36m{response.content}\033[0m") Note : If using devcontainer with VSCode add "runArgs": [ "--network=host" ] to devcontainer.json to allow connections to “localhost”. Import and install the previously created certificat docker cp C:\Docker\Ollama\localhost.crt <devcontainer name>:/usr/local/share/ca-certificates/localhost.crt docker exec <devcontainer name> "update-ca-certificates" Disclaimer The sample scripts are not supported under any Microsoft standard support program or service. The sample scripts are provided AS IS without warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.
RyadB
Nov 03, 2025 Place Core Infrastructure and Security Blog
795Views
0likes
0Comments
¡Curso oficial y gratuito de GenAI y Python! 🚀
¿Quieres aprender a usar modelos de IA generativa en tus aplicaciones de Python?Estamos organizando una serie de nueve transmisiones en vivo, en inglés y español, totalmente dedicadas a la IA generativa. Vamos a cubrir modelos de lenguaje (LLMs), modelos de embeddings, modelos de visión, y también técnicas como RAG, function calling y structured outputs. Además, te mostraremos cómo construir Agentes y servidores MCP, y hablaremos sobre seguridad en IA y evaluaciones, para asegurarnos de que tus modelos y aplicaciones generen resultados seguros. 🔗 Regístrate para toda la serie. Además de las transmisiones en vivo, puedes unirte a nuestras office hours semanales en el AI Foundry Discord de para hacer preguntas que no se respondan durante el chat. ¡Nos vemos en los streams! 👋🏻 Here’s your HTML converted into clean, readable text format (perfect for a newsletter, blog post, or social media caption): Modelos de Lenguaje 📅 7 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡Únete a la primera sesión de nuestra serie de Python + IA! En esta sesión, hablaremos sobre los Modelos de Lenguaje (LLMs), los modelos que impulsan ChatGPT y GitHub Copilot. Usaremos Python para interactuar con LLMs utilizando paquetes como el SDK de OpenAI y Langchain. Experimentaremos con prompt engineering y ejemplos few-shot para mejorar los resultados. También construiremos una aplicación full stack impulsada por LLMs y explicaremos la importancia de la concurrencia y el streaming en apps de IA orientadas al usuario. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Embeddings Vectoriales 📅 8 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la segunda sesión de Python + IA, exploraremos los embeddings vectoriales, una forma de codificar texto o imágenes como arrays de números decimales. Estos modelos permiten realizar búsquedas por similitud en distintos tipos de contenido. Usaremos modelos como la serie text-embedding-3 de OpenAI, visualizaremos resultados en Python y compararemos métricas de distancia. También veremos cómo aplicar cuantización y cómo usar modelos multimodales de embedding. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Recuperación-Aumentada Generación (RAG) 📅 9 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la tercera sesión, exploraremos RAG, una técnica que envía contexto al LLM para obtener respuestas más precisas dentro de un dominio específico. Usaremos distintas fuentes de datos —CSVs, páginas web, documentos, bases de datos— y construiremos una app RAG full-stack con Azure AI Search. Modelos de Visión 📅 14 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡La cuarta sesión trata sobre modelos de visión como GPT-4o y 4o-mini! Estos modelos pueden procesar texto e imágenes, generando descripciones, extrayendo datos, respondiendo preguntas o clasificando contenido. Usaremos Python para enviar imágenes a los modelos, crear una app de chat con imágenes e integrarlos en flujos RAG. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Salidas Estructuradas 📅 15 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la quinta sesión aprenderemos a hacer que los LLMs generen respuestas estructuradas según un esquema. Exploraremos el modo structured outputs de OpenAI y cómo aplicarlo para extracción de entidades, clasificación y flujos con agentes. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Calidad y Seguridad 📅 16 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la sexta sesión hablaremos sobre cómo usar IA de manera segura y evaluar la calidad de las salidas. Mostraremos cómo configurar Azure AI Content Safety, manejar errores en código Python y evaluar resultados con el SDK de Evaluación de Azure AI. Tool Calling 📅 21 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor En la última semana de la serie, nos enfocamos en tool calling (function calling), la base para construir agentes de IA. Aprenderemos a definir herramientas en Python o JSON, manejar respuestas de los modelos y habilitar llamadas paralelas y múltiples iteraciones. 👉 Si querés seguir los ejemplos en vivo, asegurate de tener una cuenta de GitHub. Agentes de IA 📅 22 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor ¡En la penúltima sesión construiremos agentes de IA! Usaremos frameworks como Langgraph, Semantic Kernel, Autogen, y Pydantic AI. Empezaremos con ejemplos simples y avanzaremos a arquitecturas más complejas como round-robin, supervisor, graphs y ReAct. Model Context Protocol (MCP) 📅 23 de octubre, 2025 | 10:00 PM - 11:00 PM (UTC) 🔗 Regístrate para la transmisión en Reactor Cerramos la serie con Model Context Protocol (MCP), la tecnología abierta más candente de 2025. Aprenderás a usar FastMCP para crear un servidor MCP local y conectarlo a chatbots como GitHub Copilot. También veremos cómo integrar MCP con frameworks de agentes como Langgraph, Semantic Kernel y Pydantic AI. Y, por supuesto, hablaremos sobre los riesgos de seguridad y las mejores prácticas para desarrolladores. ¿Querés que lo reformatee para publicación en Markdown (para blogs o repos) o en texto plano con emojis y separadores estilo redes sociales?
gwyneth-penas
Oct 06, 2025 Place Microsoft Developer Community Blog
239Views
1like
1Comment
Using DeepSeek-R1 on Azure with JavaScript
The pace at which innovative AI models are being developed is outstanding! DeepSeek-R1 is one such model that focuses on complex reasoning tasks, providing a powerful tool for developers to build intelligent applications. The week, we announced its availability on GitHub Models as well as on Azure AI Foundry. In this article, we’ll take a look at how you can deploy and use the DeepSeek-R1 models in your JavaScript applications. TL;DR key takeaways DeepSeek-R1 models focus on complex reasoning tasks, and is not designed for general conversation You can quickly switch your configuration to use Azure AI, GitHub Models, or even local models with Ollama. You can use OpenAI Node SDK or LangChain.js to interact with DeepSeek models. What you'll learn here Deploying DeepSeek-R1 model on Azure. Switching between Azure, GitHub Models, or local (Ollama) usage. Code patterns to start using DeepSeek-R1 with various libraries in TypeScript. Reference links DeepSeek on Azure - JavaScript demos repository Azure AI Foundry OpenAI Node SDK LangChain.js Ollama Requirements GitHub account. If you don't have one, you can create a free GitHub account. You can optionally use GitHub Copilot Free to help you write code and ship your application even faster. Azure account. If you're new to Azure, get an Azure account for free to get free Azure credits to get started. If you're a student, you can also get free credits with Azure for Students. Getting started We'll use GitHub Codespaces to get started quickly, as it provides a preconfigured Node.js environment for you. Alternatively, you can set up a local environment using the instructions found in the GitHub repository. Click on the button below to open our sample repository in a web-based VS Code, directly in your browser: Once the project is open, wait a bit to ensure everything has loaded correctly. Open a terminal and run the following command to install the dependencies: npm install Running the samples The repository contains several TypeScript files under the samples directory that demonstrate how to interact with DeepSeek-R1 models. You can run a sample using the following command: npx tsx samples/<sample>.ts For example, let's start with the first one: npx tsx samples/01-chat.ts Wait a bit, and you should see the response from the model in your terminal. You'll notice that it may take longer than usual to respond, and see a weird response that starts with a <think> tag. This is because DeepSeek-R1 is designed to be used for task that need complex reasoning, like solving problems or answering math questions, and not for you usual chat interactions. Model configuration By default, the repository is configured to use GitHub Models, so you can run any example using Codespaces without any additional setup. While it's great for quick experimentation, GitHub models limit the number of requests you can make in a day and the amount of data you can send in a single request. If you want to use the model more extensively, you can switch to Azure AI or even use a local model with Ollama. You can take a look at the samples/config.ts to see how the different configurations are set up. We'll not cover using Ollama models in this article, but you can find more information in the repository documentation. Deploying DeepSeek-R1 on Azure To experiment with the full capabilities of DeepSeek-R1, you can deploy it on Azure AI Foundry. Azure AI Foundry is a platform that allows you to deploy, manage and develop with AI models quickly. To use Azure AI Foundry, you need to have an Azure account. Let's start by deploying the model on Azure AI Foundry. First, follow this tutorial to deploy a serverless endpoint with the model. When it's time to choose the model, make sure to select the DeepSeek-R1 model in the catalog. Once your endpoint is deployed, you should be able to see your endpoint details and retrieve the URL and API key: Screenshot showing the endpoint details in Azure AI Foundry Then create a .env file in the root of the project and add the following content: AZURE_AI_BASE_URL="https://<your-deployment-name>.<region>.models.ai.azure.com/v1" AZURE_AI_API_KEY="<your-api-key>" Tip: if you're copying the endpoint from the Azure AI Foundry portal, make sure to add the /v1 at the end of the URL. Open the samples/config.ts file and update the default export to use Azure: export default AZURE_AI_CONFIG; Now all samples will use the Azure configuration. Explore reasoning with DeepSeek-R1 Now that you have the model deployed, you can start experimenting with it. Open the samples/08-reasoning.ts file to see how the model handles more complex tasks, like helping us understand a well-known weird piece of code. const prompt = ` float fast_inv_sqrt(float number) { long i; float x2, y; const float threehalfs = 1.5F; x2 = number * 0.5F; y = number; i = *(long*)&y; i = 0x5f3759df - ( i >> 1 ); y = *(float*)&i; y = y * ( threehalfs - ( x2 * y * y ) ); return y; } What is this code doing? Explain me the magic behind it. `; Now run this sample with the command: npx tsx samples/08-reasoning.ts You should see the model's response streaming piece by piece in the terminal, while describing its thought process before providing the actual answer to our question. Screenshot showing the model's response streaming in the terminal Brace yourself, as it might take a while to get the full response! At the end of the process, you should see the model's detailed explanation of the code, along with some context around it. Leveraging frameworks Most examples in this repository are built with the OpenAI Node SDK, but you can also use LangChain.js to interact with the model. This might be especially interested if you need to integrate other sources of data or want to build a more complex application. Open the file samples/07-langchain.ts to have a look at the setup, and see how you can reuse the same configuration we used with the OpenAI SDK. Going further Now it's your turn to experiment and discover the full potential of DeepSeek-R1! You can try more advanced prompts, integrate it into your larger application, or even build agents to make the most out of the model. To continue your learning journey, you can check out the following resources: Generative AI with JavaScript (GitHub): code samples and resources to learn Generative AI with JavaScript. Build a serverless AI chat with RAG using LangChain.js (GitHub): a next step code example to build an AI chatbot using Retrieval-Augmented Generation and LangChain.js.
sinedied
Jun 02, 2025 Place Microsoft Developer Community Blog
339Views
0likes
0Comments
Step-by-step: Integrate Ollama Web UI to use Azure Open AI API with LiteLLM Proxy
Introductions Ollama WebUI is a streamlined interface for deploying and interacting with open-source large language models (LLMs) like Llama 3 and Mistral, enabling users to manage models, test them via a ChatGPT-like chat environment, and integrate them into applications through Ollama’s local API. While it excels for self-hosted models on platforms like Azure VMs, it does not natively support Azure OpenAI API endpoints—OpenAI’s proprietary models (e.g., GPT-4) remain accessible only through OpenAI’s managed API. However, tools like LiteLLM bridge this gap, allowing developers to combine Ollama-hosted models with OpenAI’s API in hybrid workflows, while maintaining compliance and cost-efficiency. This setup empowers users to leverage both self-managed open-source models and cloud-based AI services. Problem Statement As of February 2025, Ollama WebUI, still do not support Azure Open AI API. The Ollama Web UI only support self-hosted Ollama API and managed OpenAI API service (PaaS). This will be an issue if users want to use Open AI models they already deployed on Azure AI Foundry. Objective To integrate Azure OpenAI API via LiteLLM proxy into with Ollama Web UI. LiteLLM translates Azure AI API requests into OpenAI-style requests on Ollama Web UI allowing users to use OpenAI models deployed on Azure AI Foundry. If you haven’t hosted Ollama WebUI already, follow my other step-by-step guide to host Ollama WebUI on Azure. Proceed to the next step if you have Ollama WebUI deployed already. Step 1: Deploy OpenAI models on Azure Foundry. If you haven’t created an Azure AI Hub already, search for Azure AI Foundry on Azure, and click on the “+ Create” button > Hub. Fill out all the empty fields with the appropriate configuration and click on “Create”. After the Azure AI Hub is successfully deployed, click on the deployed resources and launch the Azure AI Foundry service. To deploy new models on Azure AI Foundry, find the “Models + Endpoints” section on the left hand side and click on “+ Deploy Model” button > “Deploy base model” A popup will appear, and you can choose which models to deploy on Azure AI Foundry. Please note that the o-series models are only available to select customers at the moment. You can request access to the o-series models by completing this request access form, and wait until Microsoft approves the access request. Click on “Confirm” and another popup will emerge. Now name the deployment and click on “Deploy” to deploy the model. Wait a few moments for the model to deploy. Once it successfully deployed, please save the “Target URI” and the API Key. Step 2: Deploy LiteLLM Proxy via Docker Container Before pulling the LiteLLM Image into the host environment, create a file named “litellm_config.yaml” and list down the models you deployed on Azure AI Foundry, along with the API endpoints and keys. Replace "API_Endpoint" and "API_Key" with “Target URI” and “Key” found from Azure AI Foundry respectively. Template for the “litellm_config.yaml” file. model_list: - model_name: [model_name] litellm_params: model: azure/[model_name_on_azure] api_base: "[API_ENDPOINT/Target_URI]" api_key: "[API_Key]" api_version: "[API_Version]" Tips: You can find the API version info at the end of the Target URI of the model's endpoint: Sample Endpoint - https://example.openai.azure.com/openai/deployments/o1-mini/chat/completions?api-version=2024-08-01-preview Run the docker command below to start LiteLLM Proxy with the correct settings: docker run -d \ -v $(pwd)/litellm_config.yaml:/app/config.yaml \ -p 4000:4000 \ --name litellm-proxy-v1 \ --restart always \ ghcr.io/berriai/litellm:main-latest \ --config /app/config.yaml --detailed_debug Make sure to run the docker command inside the directory where you created the “litellm_config.yaml” file just now. The port used to listen for LiteLLM Proxy traffic is port 4000. Now that LiteLLM proxy had been deployed on port 4000, lets change the OpenAI API settings on Ollama WebUI. Navigate to Ollama WebUI’s Admin Panel settings > Settings > Connections > Under the OpenAI API section, write http://127.0.0.1:4000 as the API endpoint and set any key (You must write anything to make it work!). Click on “Save” button to reflect the changes. Refresh the browser and you should be able to see the AI models deployed on the Azure AI Foundry listed in the Ollama WebUI. Now let’s test the chat completion + Web Search capability using the "o1-mini" model on Ollama WebUI. Conclusion Hosting Ollama WebUI on an Azure VM and integrating it with OpenAI’s API via LiteLLM offers a powerful, flexible approach to AI deployment, combining the cost-efficiency of open-source models with the advanced capabilities of managed cloud services. While Ollama itself doesn’t support Azure OpenAI endpoints, the hybrid architecture empowers IT teams to balance data privacy (via self-hosted models on Azure AI Foundry) and cutting-edge performance (using Azure OpenAI API), all within Azure’s scalable ecosystem. This guide covers every step required to deploy your OpenAI models on Azure AI Foundry, set up the required resources, deploy LiteLLM Proxy on your host machine and configure Ollama WebUI to support Azure AI endpoints. You can test and improve your AI model even more with the Ollama WebUI interface with Web Search, Text-to-Image Generation, etc. all in one place.
suzarilshah
May 25, 2025 Place Educator Developer Blog
11KViews
1like
4Comments
MVP’s Favorite Content from Microsoft Azure MVPs
MVP’s Favorite Content is back! In this special Global Azure edition, we’re excited to share recommended articles from four Azure MVPs from around the world.
RieMoriguchi
May 11, 2025 Place Microsoft MVP Program Blog
465Views
1like
2Comments
Build AI Agents with MCP Tool Use in Minutes with AI Toolkit for VSCode
We’re excited to announce Agent Builder, the newest evolution of what was formerly known as Prompt Builder, now reimagined and supercharged for intelligent app development. This powerful tool in AI Toolkit enables you to create, iterate, and optimize agents—from prompt engineering to tool integration—all in one seamless workflow. Whether you're designing simple chat interactions or complex task-performing agents with tool access, Agent Builder simplifies the journey from idea to integration. Why Agent Builder? Agent Builder is designed to empower developers and prompt engineers to: 🚀 Generate starter prompts with natural language 🔁 Iterate and refine prompts based on model responses 🧩 Break down tasks with prompt chaining and structured outputs 🧪 Test integrations with real-time runs and tool use such as MCP servers 💻 Generate production-ready code for rapid app development And a lot of features are coming soon, stay tuned for: 📝 Use variables in prompts �� Run agent with test cases to test your agent easily 📊 Evaluate the accuracy and performance of your agent with built-in or your custom metrics ☁️ Deploy your agent to cloud Build Smart Agents with Tool Use (MCP Servers) Agents can now connect to external tools through MCP (Model Control Protocol) servers, enabling them to perform real-world actions like querying a database, accessing APIs, or executing custom logic. Connect to an Existing MCP Server To use an existing MCP server in Agent Builder: In the Tools section, select + MCP Server. Choose a connection type: Command (stdio) – run a local command that implements the MCP protocol HTTP (server-sent events) – connect to a remote server implementing the MCP protocol If the MCP server supports multiple tools, select the specific tool you want to use. Enter your prompts and click Run to test the agent's interaction with the tool. This integration allows your agents to fetch live data or trigger custom backend services as part of the conversation flow. Build and Scaffold a New MCP Server Want to create your own tool? Agent Builder helps you scaffold a new MCP server project: In the Tools section, select + MCP Server. Choose MCP server project. Select your preferred programming language: Python or TypeScript. Pick a folder to create your server project. Name your project and click Create. Agent Builder generates a scaffolded implementation of the MCP protocol that you can extend. Use the built-in VS Code debugger: Press F5 or click Debug in Agent Builder Test with prompts like: System: You are a weather forecast professional that can tell weather information based on given location. User: What is the weather in Shanghai? Agent Builder will automatically connect to your running server and show the response, making it easy to test and refine the tool-agent interaction. AI Sparks from Prototype to Production with AI Toolkit Building AI-powered applications from scratch or infusing intelligence into existing systems? AI Sparks is your go-to webinar series for mastering the AI Toolkit (AITK) from foundational concepts to cutting-edge techniques. In this bi-weekly, hands-on series, we’ll cover: 🚀SLMs & Local Models – Test and deploy AI models and applications efficiently on your own terms locally, to edge devices or to the cloud 🔍 Embedding Models & RAG – Supercharge retrieval for smarter applications using existing data. 🎨 Multi-Modal AI – Work with images, text, and beyond. 🤖 Agentic Frameworks – Build autonomous, decision-making AI systems. Watch on Demand Share your feedback  Get started with the latest version, share your feedback, and let us know how these new features help you in your AI development journey. As always, we’re here to listen, collaborate, and grow alongside our amazing user community.  Thank you for being a part of this journey—let’s build the future of AI together! Join our Microsoft Azure AI Foundry Discord channel to continue the discussion 🚀
junjieli
Apr 29, 2025 Place Microsoft Developer Community Blog
4.5KViews
1like
0Comments