Authors: Sethu Raman, Chris Hoder
As generative AI solutions move from experimentation to production, consistent performance under variable demand is becoming a critical requirement. Today, we are announcing Priority Processing is now generally available in Microsoft Foundry, a new capability designed to help organizations run latency-sensitive AI workloads with greater performance consistency and pay-per-call spending flexibility.
Predictable Performance for Latency-Sensitive AI in Production
As AI applications move into production, enterprises face growing pressure to deliver predictable, low‑latency performance for real-time copilots and agentic workflows—without upfront monthly or annual financial commitments.
Priority Processing is designed to address these deployment challenges by prioritizing latency‑sensitive inference requests with pay-per-call flexible spending, enabling SLA‑backed performance for interactive AI workloads without requiring provisioned throughput commitments.
- Enables consistent high-speed performance on a pay-as-you-go basis
- Dynamically allocates compute for time-critical workloads
- Supports real time AI applications without monthly or yearly throughput commitments
Built for Real-Time AI Experiences
Enterprise AI deployments frequently combine synchronous and asynchronous workloads such as live chat assistants, internal productivity copilots, scheduled document processing pipelines, and offline summarization jobs.
For example, a financial services organization running real-time fraud detection alongside nightly transaction summarization experienced detection latency spikes during batch windows. With Priority Processing enabled, fraud detection requests maintained consistent response times regardless of background workload volume.
Priority Processing integrates directly into Microsoft Foundry deployments and can be applied across a range of production use cases, including:
- Real-time customer engagement copilots
- Interactive developer tools
- Financial services decisioning workflows
- Operational dashboards
- AI-powered agent orchestration scenarios
Organizations can differentiate between background workloads and interactive production applications without modifying their infrastructure or resource management strategies.
Customer Spotlight
Organizations including Adobe and Harvey are using Priority Processing in Microsoft Foundry to support latency-sensitive AI experiences while maintaining throughput for background workloads.
Early adopters report improved responsiveness for interactive workloads while continuing to process asynchronous jobs without requiring dedicated infrastructure or manual traffic management during peak demand periods.
Pricing
Priority Processing uses the same token-based pricing model as Standard. It is available in both Global and Data Zone deployments. For Global deployments, Priority Processing is priced at a premium over the Standard tier (for example, 2× for the latest models such as GPT 5.4), reflecting prioritized access for latency sensitive workloads. Data Zone deployments carry a modest additional 10% uplift over Global pricing to support regional data processing requirements.
Choose the Right Deployment for Your Workload Needs
Microsoft Foundry provides flexibility for organizations to select deployment options based on three workload considerations:
- Data processing requirements
- Latency performance needs
- Overall throughput requirements
To help customers choose the right deployment for your workload needs, we recommend the following steps:
Select the Right Data Processing Boundary
- Global: Broadest model access and highest throughput at the lowest price.
- Data Zone: Data processed within US/EU boundaries with higher price and lower default throughput.
- Regional: Strict data residency for regulated environments with reduced model availability.
Align Deployment to Latency Sensitivity
- Latency-sensitive production workloads should use Priority Processing.
- Balanced production workloads can run on Standard deployments.
- Mission-critical, high-scale production workloads should consider Provisioned Throughput
- Bulk processing workloads without latency requirements can run using Batch deployments.
Diagram: Guidance on Selecting the Right Deployment
|
Workload Type and Needs |
Recommended Deployment Type |
|
Latency-sensitive production |
Priority Processing for Standard |
|
Balanced production |
Standard |
|
Mission-critical, high-scale |
Provisioned Throughput (PTUM) |
|
Bulk processing |
Batch |
Scale from Development to Production
Customers can deploy using pay-per-call environments and scale to commitment-based offers as workloads grow, unlocking:
- Lower cost at scale
- Service level agreements (SLAs)
- Enterprise production features
With Foundry full‑stack deployments, customers can flexibly scale AI workloads from development to mission‑critical production—balancing performance, reliability, and cost efficiency without re‑architecting their infrastructure.
Get Started
Evaluate Priority Processing for your production AI applications by reviewing Microsoft Foundry documentation or connecting with your Microsoft account team to assess deployment readiness for latency-sensitive workloads.