azure cognitive services
20 TopicsCompose for Agents on Azure Container Apps and Serverless GPU (public preview)
Empowering intelligent applications The next wave of AI is agentic – systems that can reason, plan, and act on our behalf. Whether you’re building a virtual assistant that books travel or a multi‑model workflow that triages support tickets, these applications rely on multiple models, tools, and services working together. Unfortunately, building them has not been easy: Tooling sprawl. Developers must wire together LLMs, vector databases, MCP (Model Context Protocol) tools and orchestration logic, often across disparate SDKs and running processes. Keeping those dependencies in sync for local development and production is tedious and error‑prone. Specialized hardware. Large language models and agent orchestration frameworks often require GPUs to run effectively. Procuring and managing GPU instances can be costly, particularly for prototypes and small teams. Operational complexity. Agentic applications are typically composed of many services. Scaling them, managing health and secure connectivity, and reproducing the same environment from a developer laptop into production quickly becomes a full‑time job. Why Azure Container Apps is the right home With Azure Container Apps (ACA), you can now tackle these challenges without sacrificing the familiar Docker Compose workflow that so many developers love. We’re excited to announce that Compose for Agents is in public preview on Azure Container Apps. This integration brings the power of Docker’s new agentic tooling to a platform that was built for serverless containers. Here’s why ACA is the natural home for agentic workloads: Serverless GPUs with per‑second billing. Container Apps offers serverless GPU compute. Your agentic workloads can run on GPUs only when they need to, and you only pay for the seconds your container is actually running. This makes it economical to prototype and scale complex models without upfront infrastructure commitments. Media reports on the preview note that Docker’s Offload service uses remote GPUs via cloud providers such as Microsoft to overcome local hardware limits, and ACA brings that capability directly into the Azure native experience. Sandboxed dynamic sessions for tools. Many agentic frameworks execute user‑provided code as part of their workflows. ACA’s dynamic sessions provide secure, short‑lived sandboxes for running these tasks. This means untrusted or transient code (for example, evaluation scripts or third‑party plugins) runs in an isolated environment, keeping your production services safe. Fully managed scaling and operations. Container Apps automatically scales each service based on traffic and queue length, and it can scale down to zero when idle. You get built‑in service discovery, ingress, rolling updates and revision management without having to operate your own orchestrator. Developers can focus on building agents rather than patching servers. First‑class Docker Compose support. Compose remains a favourite tool for developers’ inner loop and for orchestrating multi‑container systems. Compose for Agents extends the format to declare open‑source models, agents and tools alongside your microservices. By pointing docker compose up at ACA, the same YAML file you use locally now deploys automatically to a fully managed container environment. Model Runner and MCP Gateway built in. Docker’s Model Runner lets you pull open‑weight language models from Docker Hub and exposes them via OpenAI‑compatible endpoints, and the MCP (Model Context Protocol) Gateway connects your agents to curated tools. ACA integrates these components into your Compose stack, giving you everything you need for retrieval‑augmented generation, vector search or domain‑specific tool invocation. What this means for developers The Compose for Agents public preview on Container Apps brings together the simplicity of Docker Compose and the operational power of Azure’s serverless compute platform. Developers can now: Define agent stacks declaratively. Instead of cobbling together scripts, you describe your entire agentic application in a single compose.yaml file. Compose already supports popular frameworks like LangGraph, Embabel, Vercel AI SDK, Spring AI, Crew AI, Google ADK and Agno. You can mix and match these frameworks with your own microservices, databases and queues. Run anywhere with the same configuration. Docker emphasizes that you can “define your open models, agents and MCP‑compatible tools, then spin up your full agentic stack with a simple docker compose up”. By bringing this workflow to ACA, Microsoft ensures that the same compose file runs unchanged on your laptop and in the cloud. Scale seamlessly. Large language models and multi‑agent orchestration can be compute‑intensive. News coverage notes that Docker’s Offload service provides remote GPUs for these workloads ACA extends that capability with serverless GPUs and automated scaling, letting you test locally and then burst to the cloud with no changes to your YAML. Collaboration with Docker This preview is the result of close collaboration between Microsoft and Docker. A Docker has always been focused on simplifying complex developer workflows. “With Compose for Agents, we’re extending that same experience that developers know and love from containers to agents, bringing the power of Compose to the emerging world of AI-native, agentic applications. It delivers the same simplicity and predictability to prototyping, testing, and deploying across local and cloud environments” said Elyi Aleyner, VP of Strategy and Head of Tech Alliances at Docker. “We’re excited to partner with Microsoft to bring this innovation to Azure Container Apps, enabling developers to go from ‘compose up’ on their laptops to secure, GPU-backed workloads in the cloud with zero friction.” Empowering choice Every team has its own favourite frameworks and tools. We’ve ensured that Compose for Agents on ACA is framework‑agnostic: you can use LangGraph for complex workflows, CrewAI for multi‑agent coordination, or Spring AI to integrate with your existing Java stack. Want to run a vector store from the MCP catalog alongside your own service? Simply add it to your Compose file. Docker’s curated catalog provides over a hundred ready‑to‑use tools and services for retrieval, document summarization, database access and more. ACA’s flexibility means you’re free to choose the stack that best fits your problem. Get started today The public preview of Compose for Agents support in Azure Container Apps is available now. You can: Install the latest Azure Container Apps Extension Define your application in a compose.yaml file, including models, tools and agent code and deploy to ACA via az containerapp compose up. ACA will provision GPU resources, dynamic sessions and auto‑scaling infrastructure automatically. Iterate locally using standard docker compose up commands, then push the same configuration to the cloud when you’re ready. For more detailed instructions please go to https://aka.ms/aca/compose-for-agents555Views2likes0CommentsFrom Timeouts to Triumph: Optimizing GPT-4o-mini for Speed, Efficiency, and Reliability
The Challenge Large-scale generative AI deployments can stretch system boundaries — especially when thousands of concurrent requests require both high throughput and low latency. In one such production environment, GPT-4o-mini deployments running under Provisioned Throughput Units (PTUs) began showing sporadic 408 (timeout) and 429 (throttling) errors. Requests that normally completed in seconds were occasionally hitting the 60-second timeout window, causing degraded experiences and unnecessary retries. Initial suspicion pointed toward PTU capacity limitations, but deeper telemetry revealed a different cause. What the Data Revealed Using Azure Data Explorer (Kusto), API Management (APIM) logs, and OpenAI billing telemetry, a detailed investigation uncovered several insights: Latency was not correlated with PTU utilization: PTU resources were healthy and performing within SLA even during spikes. Time-Between-Tokens (TBT) stayed consistently low (~8–10 ms): The model was generating tokens steadily. Excessive token output was the real bottleneck: Requests generating 6K–8K tokens simply required more time than allowed in the 60-second completion window. In short — the model wasn’t slow; the workload was oversized. The Optimization Opportunity The analysis opened a broader optimization opportunity: Balance token length with throughput targets. Introduce architectural patterns to prevent timeout or throttling cascades under load. Enforce automatic token governance instead of relying on client-side discipline. The Solution Three engineering measures delivered immediate impact: token optimization, spillover routing, and policy enforcement. Right-size the Token Budget Empirical throughput for GPT-4o-mini: ~33 tokens/sec → ~2K tokens in 60s. Enforced max_tokens = 2000 for synchronous requests. Enabled streaming responses for longer outputs, allowing incremental delivery without hitting timeout limits. Enable Spillover for Continuity Implemented multi-region spillover using Azure Front Door and APIM Premium gateways. When PTU queues reached capacity or 429s appeared, requests were routed to Standard deployments in secondary regions. The result: graceful degradation and uninterrupted user experience. Govern with APIM Policies Added inbound policies to inspect and adjust max_tokens dynamically. On 408/429 responses, APIM retried and rerouted traffic based on spillover logic. The Results After optimization, improvements were immediate and measurable: Latency Reduction: Significant improvement in end-to-end response times across high-volume workloads Reliability Gains: 408/429 errors fell from >1% to near zero. Cost Efficiency: Average token generation decreased by ~60%, reducing per-request costs. Scalability: Spillover routing ensured consistent performance during regional or capacity surges. Governance: APIM policies established a reusable token-control framework for future AI workloads. Lessons Learned Latency isn’t always about capacity: Investigate workload patterns before scaling hardware. Token budgets define the user experience: Over-generation can quietly break SLA compliance. Design for elasticity: Spillover and multi-region routing maintain continuity during spikes. Measure everything: Combine KQL telemetry, latency and token tracking for faster diagnostics. The Outcome By applying data-driven analysis, architectural tuning, and automated governance, the team turned an operational bottleneck into a model of consistent, scalable performance. The result: Faster responses. Lower costs. Higher trust. A blueprint for building resilient, high-throughput AI systems on Azure.383Views4likes0CommentsCalculating Chargebacks for Business Units/Projects Utilizing a Shared Azure OpenAI Instance
Azure OpenAI Service is at the forefront of technological innovation, offering REST API access to OpenAI's suite of revolutionary language models, including GPT-4, GPT-35-Turbo, and the Embeddings model series. Enhancing Throughput for Scale As enterprises seek to deploy OpenAI's powerful language models across various business units, they often require granular control over configuration and performance metrics. To address this need, Azure OpenAI Service is introducing dedicated throughput, a feature that provides a dedicated connection to OpenAI models with guaranteed performance levels. Throughput is quantified in terms of tokens per second (tokens/sec), allowing organizations to precisely measure and optimize the performance for both prompts and completions. The model of provisioned throughput provides enhanced management and adaptability for varying workloads, guaranteeing system readiness for spikes in demand. This capability also ensures a uniform user experience and steady performance for applications that require real-time responses. Resource Sharing and Chargeback Mechanisms Large organizations frequently provision a singular instance of Azure OpenAI Service that is shared across multiple internal departments. This shared use necessitates an efficient mechanism for allocating costs to each business unit or consumer, based on the number of tokens consumed. This article delves into how chargeback is calculated for each business unit based on their token usage. Leveraging Azure API Management Policies for Token Tracking Azure API Management Policies offer a powerful solution for monitoring and logging the token consumption for each internal application. The process can be summarized in the following steps: ** Sample Code: Refer to this GitHub repository to get a step-by-step instruction on how to build the solution outlined below : private-openai-with-apim-for-chargeback 1. Client Applications Authorizes to API Management To make sure only legitimate clients can call the Azure OpenAI APIs, each client must first authenticate against Azure Active Directory and call APIM endpoint. In this scenario, the API Management service acts on behalf of the backend API, and the calling application requests access to the API Management instance. The scope of the access token is between the calling application and the API Management gateway. In API Management, configure a policy (validate-jwt or validate-azure-ad-token) to validate the token before the gateway passes the request to the backend. 2. APIM redirects the request to OpenAI service via private endpoint. Upon successful verification of the token, Azure API Management (APIM) routes the request to Azure OpenAI service to fetch response for completions endpoint, which also includes prompt and completion token counts. 3. Capture and log API response to Event Hub Leveraging the log-to-eventhub policy to capture outgoing responses for logging or analytics purposes. To use this policy, a logger needs to be configured in the API Management: # API Management service-specific details $apimServiceName = "apim-hello-world" $resourceGroupName = "myResourceGroup" # Create logger $context = New-AzApiManagementContext -ResourceGroupName $resourceGroupName -ServiceName $apimServiceName New-AzApiManagementLogger -Context $context -LoggerId "OpenAiChargeBackLogger" -Name "ApimEventHub" -ConnectionString "Endpoint=sb://<EventHubsNamespace>.servicebus.windows.net/;SharedAccessKeyName=<KeyName>;SharedAccessKey=<key>" -Description "Event hub logger with connection string" Within outbound policies section, pull specific data from the body of the response and send this information to the previously configured EventHub instance. This is not just a simple logging exercise; it is an entry point into a whole ecosystem of real-time analytics and monitoring capabilities: <outbound> <choose> <when condition="@(context.Response.StatusCode == 200)"> <log-to-eventhub logger-id="TokenUsageLogger">@{ var responseBody = context.Response.Body?.As<JObject>(true); return new JObject( new JProperty("Timestamp", DateTime.UtcNow.ToString()), new JProperty("ApiOperation", responseBody["object"].ToString()), new JProperty("AppKey", context.Request.Headers.GetValueOrDefault("Ocp-Apim-Subscription-Key",string.Empty)), new JProperty("PromptTokens", responseBody["usage"]["prompt_tokens"].ToString()), new JProperty("CompletionTokens", responseBody["usage"]["completion_tokens"].ToString()), new JProperty("TotalTokens", responseBody["usage"]["total_tokens"].ToString()) ).ToString(); }</log-to-eventhub> </when> </choose> <base /> </outbound> EventHub serves as a powerful fulcrum, offering seamless integration with a wide array of Azure and Microsoft services. For example, the logged data can be directly streamed to Azure Stream Analytics for real-time analytics or to Power BI for real-time dashboards With Azure Event Grid, the same data can also be used to trigger workflows or automate tasks based on specific conditions met in the incoming responses. Moreover, the architecture is extensible to non-Microsoft services as well. Event Hubs can interact smoothly with external platforms like Apache Spark, allowing you to perform data transformations or feed machine learning models. 4: Data Processing with Azure Functions An Azure Function is invoked when data is sent to the EventHub instance, allowing for bespoke data processing in line with your organization’s unique requirements. For instance, this could range from dispatching the data to Azure Monitor, streaming it to Power BI dashboards, or even sending detailed consumption reports via Azure Communication Service. [Function("TokenUsageFunction")] public async Task Run([EventHubTrigger("%EventHubName%", Connection = "EventHubConnection")] string[] openAiTokenResponse) { //Eventhub Messages arrive as an array foreach (var tokenData in openAiTokenResponse) { try { _logger.LogInformation($"Azure OpenAI Tokens Data Received: {tokenData}"); var OpenAiToken = JsonSerializer.Deserialize<OpenAiToken>(tokenData); if (OpenAiToken == null) { _logger.LogError($"Invalid OpenAi Api Token Response Received. Skipping."); continue; } _telemetryClient.TrackEvent("Azure OpenAI Tokens", OpenAiToken.ToDictionary()); } catch (Exception e) { _logger.LogError($"Error occured when processing TokenData: {tokenData}", e.Message); } } } In the example above, Azure function processes the tokens response data in Event Hub and sends them to Application Insights telemetry, and a basic Dashboard is configured in Azure, displaying the token consumption for each client application. This information can conveniently be used to compute chargeback costs. A sample query used in dashboard above that fetches tokens consumed by a specific client: customEvents | where name contains "Azure OpenAI Tokens" | extend tokenData = parse_json(customDimensions) | where tokenData.AppKey contains "your-client-key" | project Timestamp = tokenData.Timestamp, Stream = tokenData.Stream, ApiOperation = tokenData.ApiOperation, PromptTokens = tokenData.PromptTokens, CompletionTokens = tokenData.CompletionTokens, TotalTokens = tokenData.TotalTokens Azure OpenAI Landing Zone reference architecture A crucial detail to ensure the effectiveness of this approach is to secure the Azure OpenAI service by implementing Private Endpoints and using Managed Identities for App Service to authorize access to Azure AI services. This will limit access so that only the App Service can communicate with the Azure OpenAI service. Failing to do this would render the solution ineffective, as individuals could bypass the APIM/App Service and directly access the OpenAI Service if they get hold of the access key for OpenAI. Refer to Azure OpenAI Landing Zone reference architecture to build a secure and scalable AI environment. Additional Considerations If the client application is external, consider using an Application Gateway in front of the Azure APIM If "streaming" is set to true, tokens count is not returned in response. In that that case libraries like tiktoken (Python), orgpt-3-encoder(javascript) for most GPT-3 models can be used to programmatically calculate tokens count for the user prompt and completion response. A useful guideline to remember is that in typical English text, one token is approximately equal to around 4 characters. This equates to about three-quarters of a word, meaning that 100 tokens are roughly equivalent to 75 words. (P.S. Microsoft does not endorse or guarantee any third-party libraries.) A subscription key or a custom header like app-key can also be used to uniquely identify the client as appId in OAuth token is not very intuitive. Rate-limiting can be implemented for incoming requests using OAuth tokens or Subscription Keys, adding another layer of security and resource management. The solution can also be extended to redirect different clients to different Azure OpenAI instances. For example., some clients utilize an Azure OpenAI instance with default quotas, whereas premium clients get to consume Azure Open AI instance with dedicated throughput. Conclusion Azure OpenAI Service stands as an indispensable tool for organizations seeking to harness the immense power of language models. With the feature of provisioned throughput, clients can define their usage limits in throughput units and freely allocate these to the OpenAI model of their choice. However, the financial commitment can be significant and is dependent on factors like the chosen model's type, size, and utilization. An effective chargeback system offers several advantages, such as heightened accountability, transparent costing, and judicious use of resources within the organization.22KViews10likes10CommentsUnlock New AI and Cloud Potential with .NET 9 & Azure: Faster, Smarter, and Built for the Future
.NET 9, now available to developers, marks a significant milestone in the evolution of the .NET platform, pushing the boundaries of performance, cloud-native development, and AI integration. This release, shaped by contributions from over 9,000 community members worldwide, introduces thousands of improvements that set the stage for the future of application development. With seamless integration with Azure and a focus on cloud-native development and AI capabilities, .NET 9 empowers developers to build scalable, intelligent applications with unprecedented ease. Expanding Azure PaaS Support for .NET 9 With the release of .NET 9, a comprehensive range of Azure Platform as a Service (PaaS) offerings now fully support the platform’s new capabilities, including the latest .NET SDK for any Azure developer. This extensive support allows developers to build, deploy, and scale .NET 9 applications with optimal performance and adaptability on Azure. Additionally, developers can access a wealth of architecture references and sample solutions to guide them in creating high-performance .NET 9 applications on Azure’s powerful cloud services: Azure App Service: Run, manage, and scale .NET 9 web applications efficiently. Check out this blog to learn more about what's new in Azure App Service. Azure Functions: Leverage serverless computing to build event-driven .NET 9 applications with improved runtime capabilities. Azure Container Apps: Deploy microservices and containerized .NET 9 workloads with integrated observability. Azure Kubernetes Service (AKS): Run .NET 9 applications in a managed Kubernetes environment with expanded ARM64 support. Azure AI Services and Azure OpenAI Services: Integrate advanced AI and OpenAI capabilities directly into your .NET 9 applications. Azure API Management, Azure Logic Apps, Azure Cognitive Services, and Azure SignalR Service: Ensure seamless integration and scaling for .NET 9 solutions. These services provide developers with a robust platform to build high-performance, scalable, and cloud-native applications while leveraging Azure’s optimized environment for .NET. Streamlined Cloud-Native Development with .NET Aspire .NET Aspire is a game-changer for cloud-native applications, enabling developers to build distributed, production-ready solutions efficiently. Available in preview with .NET 9, Aspire streamlines app development, with cloud efficiency and observability at its core. The latest updates in Aspire include secure defaults, Azure Functions support, and enhanced container management. Key capabilities include: Optimized Azure Integrations: Aspire works seamlessly with Azure, enabling fast deployments, automated scaling, and consistent management of cloud-native applications. Easier Deployments to Azure Container Apps: Designed for containerized environments, .NET Aspire integrates with Azure Container Apps (ACA) to simplify the deployment process. Using the Azure Developer CLI (azd), developers can quickly provision and deploy .NET Aspire projects to ACA, with built-in support for Redis caching, application logging, and scalability. Built-In Observability: A real-time dashboard provides insights into logs, distributed traces, and metrics, enabling local and production monitoring with Azure Monitor. With these capabilities, .NET Aspire allows developers to deploy microservices and containerized applications effortlessly on ACA, streamlining the path from development to production in a fully managed, serverless environment. Integrating AI into .NET: A Seamless Experience In our ongoing effort to empower developers, we’ve made integrating AI into .NET applications simpler than ever. Our strategic partnerships, including collaborations with OpenAI, LlamaIndex, and Qdrant, have enriched the AI ecosystem and strengthened .NET’s capabilities. This year alone, usage of Azure OpenAI services has surged to nearly a billion API calls per month, illustrating the growing impact of AI-powered .NET applications. Real-World AI Solutions with .NET: .NET has been pivotal in driving AI innovations. From internal teams like Microsoft Copilot creating AI experiences with .NET Aspire to tools like GitHub Copilot, developed with .NET to enhance productivity in Visual Studio and VS Code, the platform showcases AI at its best. KPMG Clara is a prime example, developed to enhance audit quality and efficiency for 95,000 auditors worldwide. By leveraging .NET and scaling securely on Azure, KPMG implemented robust AI features aligned with strict industry standards, underscoring .NET and Azure as the backbone for high-performing, scalable AI solutions. Performance Enhancements in .NET 9: Raising the Bar for Azure Workloads .NET 9 introduces substantial performance upgrades with over 7,500 merged pull requests focused on speed and efficiency, ensuring .NET 9 applications run optimally on Azure. These improvements contribute to reduced cloud costs and provide a high-performance experience across Windows, Linux, and macOS. To see how significant these performance gains can be for cloud services, take a look at what past .NET upgrades achieved for Microsoft’s high-scale internal services: Bing achieved a major reduction in startup times, enhanced efficiency, and decreased latency across its high-performance search workflows. Microsoft Teams improved efficiency by 50%, reduced latency by 30–45%, and achieved up to 100% gains in CPU utilization for key services, resulting in faster user interactions. Microsoft Copilot and other AI-powered applications benefited from optimized runtime performance, enabling scalable, high-quality experiences for users. Upgrading to the latest .NET version offers similar benefits for cloud apps, optimizing both performance and cost-efficiency. For more information on updating your applications, check out the .NET Upgrade Assistant. For additional details on ASP.NET Core, .NET MAUI, NuGet, and more enhancements across the .NET platform, check out the full Announcing .NET 9 blog post. Conclusion: Your Path to the Future with .NET 9 and Azure .NET 9 isn’t just an upgrade—it’s a leap forward, combining cutting-edge AI integration, cloud-native development, and unparalleled performance. Paired with Azure’s scalability, these advancements provide a trusted, high-performance foundation for modern applications. Get started by downloading .NET 9 and exploring its features. Leverage .NET Aspire for streamlined cloud-native development, deploy scalable apps with Azure, and embrace new productivity enhancements to build for the future. For additional insights on ASP.NET, .NET MAUI, NuGet, and more, check out the full Announcing .NET 9 blog post. Explore the future of cloud-native and AI development with .NET 9 and Azure—your toolkit for creating the next generation of intelligent applications.9.8KViews2likes1CommentGenerative AI with JavaScript FREE course
JavaScript devs, now’s your chance to tap into the potential of Generative AI! Whether you’re just curious or ready to level up your apps, our new video series is packed with everything you need to start building AI-powered applications.3.6KViews0likes0CommentsOpenAI at Scale: Maximizing API Management through Effective Service Utilization
Harnessing Azure OpenAI at Scale: Effective API Management with Circuit Breaker, Retry, and Load Balance Unlock the full potential of Azure OpenAI by leveraging the advanced capabilities of Azure API Management. This guide explores how to effectively utilize Circuit Breaker, Retry, and Load Balance strategies to optimize backends and ensure seamless service utilization. Learn best practices for integrating OpenAI services, enhancing performance, and achieving scalability through robust API management policies.6.8KViews3likes0CommentsAdd a context-grounded AI chatbot to your Azure Static Web Apps with streaming responses
With recent announcements, we can stream AI responses to our Azure Static Web Apps and build chatbot experiences to make it easier to interact with content-heavy websites. In this article, we'll demonstrate how to build a natural language chatbot for our website following the retrieval augmented generation (RAG) pattern and stream the natural language response to our user.20KViews0likes1CommentAzure OpenAI Extension for Function Apps Hands-on Experience
This blog will give some insights on the newly released Azure OpenAI extension. It will combine both Azure OpenAI service and Azure Function Apps. We will discuss the following contents: Why this extension? What’s the current requirements and support scope? How to use this?3.6KViews4likes0CommentsIntelligent app on Azure Container Apps Landing Zone Accelerator
Are you looking for deploying your AI infused or intelligent app to Azure Container apps, look no further!! Check out the end-to-end guidance to fast-track your journey to production with intelligent applications, it's crucial to implement your solutions adhering to the most effective practices in security, monitoring, networking, and operational excellence.3KViews4likes0CommentsNext-Gen Customer Service: Azure's AI-Powered Speech, Translation and Summarization
Break down language barriers effortlessly! Dive into our upcoming demo showcasing the power of AI integration (client-side). Learn how to transcribe, translate, synthesize, and summarize conversations in real-time with Azure AI services. Stay tuned for an enlightening journey through seamless multilingual communication.8.7KViews2likes0Comments