performance
572 TopicsScaling Write Throughput in Azure Database for MySQL Using Application-Level Sharding
This blog post walks through scaling write throughput in Azure Database for MySQL using application level sharding. It starts with the why behind sharding and then builds a complete C# implementation that spreads writes across three Azure Database for MySQL Flexible Servers. Why Shard in the First Place? This post focuses specifically on scaling write throughput. A well-tuned single primary node can take you remarkably far, and techniques such as indexing strategies, write batching, redo log optimization, and vertical compute scaling each deliver real, lasting value. For many workloads, these optimizations are all you will ever need. That said, as write volume continues to grow, a single primary eventually approaches its practical capacity, and at that point the most durable way to keep scaling is to distribute the write workload across multiple primary instances. This architecture is what we call sharding. When you reach this inflection point, there are two primary patterns for managing multiple write nodes: Proxy or Middleware Layer Sharding: A sharding aware proxy sits between the application and a pool of Azure Database for MySQL instances, routing queries based on a shard key. While this abstracts the underlying topology from the application layer, it introduces an additional, complex component to operate, secure, scale, and patch. Application Layer Sharding: The application itself resolves the destination shard key and determines which of the N Azure Database for MySQL instances should receive a write before ever opening a database connection. Each backend target remains a completely standard, independent Azure Database for MySQL instance. This post explores the second approach. The core appeal of application layer sharding is architectural simplicity: it introduces zero infrastructure overhead and eliminates an extra network hop. Every shard behaves exactly like a standalone instance, meaning your existing backup, restore, monitoring pipelines, and the Azure portal function seamlessly without modification. The explicit tradeoff is that you forgo cross shard joins and distributed transactions in exchange for absolute predictability and control over data access patterns. The Plan We will build a small order management service that distributes its data across three Azure Database for MySQL instances that already exist. The application, written in C# on .NET 8, owns the partitioning logic. The premise: the three servers are already provisioned, the firewalls are configured, the network paths are established, and each server has its own administrative credentials. We are not provisioning infrastructure in this post. we are writing the application code that consumes it. mysql-shard-0.mysql.database.azure.com user: shard0_admin pwd: <secret-0> mysql-shard-1.mysql.database.azure.com user: shard1_admin pwd: <secret-1> mysql-shard-2.mysql.database.azure.com user: shard2_admin pwd: <secret-2> Each server hosts an identical appdb database with the same schema: CREATE TABLE users ( user_id BIGINT NOT NULL PRIMARY KEY, email VARCHAR(255) NOT NULL, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, UNIQUE KEY uq_email (email) ); CREATE TABLE orders ( order_id BIGINT NOT NULL PRIMARY KEY, user_id BIGINT NOT NULL, amount_cents INT NOT NULL, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, KEY ix_user (user_id) ); Two design decisions in this schema warrant explanation: No AUTO_INCREMENT for user_id or order_id. Two shards would otherwise generate the same value 42 independently. Instead, we assign identifiers in the application, using a scheme such as Snowflake, ULID, or UUIDv7. orders carries user_id, and we route by it. This is the single most important rule of sharding: choose a shard key that keeps related data colocated, so that the common queries remain on a single shard. A note on UNIQUE KEY uq_email. A unique index enforces uniqueness only within a single physical shard. Because we route by user_id, two users with different IDs and the same email may land on different shards, and both inserts will succeed. If you require globally unique emails, two options exist: (a) maintain a separate email → user_id lookup table on a single "directory" server and write to it first within an idempotent flow, or (b) shard the users table by a hash of email instead. We retain user_id routing throughout this post because it is the correct choice for orders, and we treat per shard email uniqueness as a best effort guard rather than a hard global invariant. How the Partitioning Works The naive approach to sharding is shard = hash(key) % N. This works until you need to add a fourth server, at which point roughly 75% of your data must move. In any system of meaningful size, that is prohibitively expensive. The established solution is virtual buckets. You hash the key into a large, fixed bucket space (here, 1024), then map buckets to physical shards. When you add capacity, you relocate only buckets; you never rehash the entire dataset. In production, the bucket_to_shard_map typically resides in a system such as Azure App Configuration or etcd, so that you can rebalance without redeploying. For this post, we keep it as an in-memory array seeded at startup, which is straightforward to replace later. The Project ShardingDemo/ ├── ShardingDemo.csproj ├── appsettings.json ├── Models.cs ├── ShardRouter.cs ├── UserRepository.cs └── Program.cs ShardingDemo.csproj <Project Sdk="Microsoft.NET.Sdk"> <PropertyGroup> <OutputType>Exe</OutputType> <TargetFramework>net8.0</TargetFramework> <Nullable>enable</Nullable> <ImplicitUsings>enable</ImplicitUsings> </PropertyGroup> <ItemGroup> <PackageReference Include="MySqlConnector" Version="2.6.0" /> <PackageReference Include="Microsoft.Extensions.Hosting" Version="8.0.0" /> <PackageReference Include="Microsoft.Extensions.Configuration.Binder" Version="8.0.0" /> </ItemGroup> <ItemGroup> <Content Include="appsettings.json" CopyToOutputDirectory="PreserveNewest" /> </ItemGroup> </Project> appsettings.json Shards is an ordered list, and a shard's position in the array is its logical ID. { "Shards": [ { "Host": "mysql-shard-0.mysql.database.azure.com", "Database": "appdb", "User": "shard0_admin", "Password": "REPLACE_ME_0" }, { "Host": "mysql-shard-1.mysql.database.azure.com", "Database": "appdb", "User": "shard1_admin", "Password": "REPLACE_ME_1" }, { "Host": "mysql-shard-2.mysql.database.azure.com", "Database": "appdb", "User": "shard2_admin", "Password": "REPLACE_ME_2" } ] } Models.cs namespace ShardingDemo; public sealed record User(long UserId, string Email, DateTime CreatedAt); public sealed record Order(long OrderId, long UserId, int AmountCents, DateTime CreatedAt); public sealed class ShardConfig { public required string Host { get; init; } public required string Database { get; init; } public required string User { get; init; } public required string Password { get; init; } } ShardRouter.cs using System.Security.Cryptography; using System.Text; using MySqlConnector; namespace ShardingDemo; public sealed class Shard : IAsyncDisposable { public int Id { get; } public MySqlDataSource DataSource { get; } public Shard(int id, ShardConfig cfg) { Id = id; var csb = new MySqlConnectionStringBuilder { Server = cfg.Host, Port = 3306, Database = cfg.Database, UserID = cfg.User, Password = cfg.Password, SslMode = MySqlSslMode.Required, Pooling = true, MinimumPoolSize = 2, MaximumPoolSize = 100, ConnectionTimeout = 10, DefaultCommandTimeout = 30, }; DataSource = new MySqlDataSourceBuilder(csb.ConnectionString).Build(); } public ValueTask DisposeAsync() => DataSource.DisposeAsync(); } public sealed class ShardRouter : IAsyncDisposable { private const int VirtualBuckets = 1024; private readonly IReadOnlyList<Shard> _shards; private readonly int[] _bucketToShardId; public ShardRouter(IEnumerable<ShardConfig> configs) { _shards = configs.Select((c, i) => new Shard(i, c)).ToList(); // Even distribution. Replace with a map loaded from your control plane for live rebalancing. _bucketToShardId = new int[VirtualBuckets]; for (int i = 0; i < VirtualBuckets; i++) _bucketToShardId[i] = i % _shards.Count; } public IReadOnlyList<Shard> AllShards => _shards; private static int BucketFor(long shardKey) { byte[] hash = MD5.HashData(Encoding.ASCII.GetBytes(shardKey.ToString())); // Use the first byte pair as an unsigned value, then map it into the bucket space. int value = (hash[0] << 8) | hash[1]; return value % VirtualBuckets; } public Shard ShardForKey(long shardKey) { int bucket = BucketFor(shardKey); return _shards[_bucketToShardId[bucket]]; } public async ValueTask DisposeAsync() { foreach (var s in _shards) await s.DisposeAsync(); } } UserRepository.cs Observe that every per user method calls ShardForKey(userId), even when inserting an order. This is the colocation rule at work. An order and its owning user always reside on the same shard, so queries for a single user only ever reach one shard. Only the cross-shard aggregate (TotalRevenueCentsAsync) must fan out. using MySqlConnector; namespace ShardingDemo; public sealed class UserRepository { private readonly ShardRouter _router; public UserRepository(ShardRouter router) { _router = router; } public async Task CreateUserAsync(long userId, string email, CancellationToken ct = default) { var shard = _router.ShardForKey(userId); await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = "INSERT INTO users (user_id, email) VALUES (@id, Email)"; cmd.Parameters.AddWithValue("@id", userId); cmd.Parameters.AddWithValue("@email", email); await cmd.ExecuteNonQueryAsync(ct); } public async Task<User?> GetUserAsync(long userId, CancellationToken ct = default) { var shard = _router.ShardForKey(userId); await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = "SELECT user_id, email, created_at FROM users WHERE user_id = ID"; cmd.Parameters.AddWithValue("@id", userId); await using var reader = await cmd.ExecuteReaderAsync(ct); if (!await reader.ReadAsync(ct)) return null; return new User(reader.GetInt64(0), reader.GetString(1), reader.GetDateTime(2)); } public async Task AddOrderAsync(long orderId, long userId, int amountCents, CancellationToken ct = default) { // Routed by user_id, so orders colocate with their owning user. var shard = _router.ShardForKey(userId); await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = """ INSERT INTO orders (order_id, user_id, amount_cents) VALUES (@oid, @uid, amt) """; cmd.Parameters.AddWithValue("@oid", orderId); cmd.Parameters.AddWithValue("@uid", userId); cmd.Parameters.AddWithValue("@amt", amountCents); await cmd.ExecuteNonQueryAsync(ct); } public async Task<IReadOnlyList<Order>> GetOrdersForUserAsync(long userId, CancellationToken ct = default) { var shard = _router.ShardForKey(userId); await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = """ SELECT order_id, user_id, amount_cents, created_at FROM orders WHERE user_id = @uid """; cmd.Parameters.AddWithValue("@uid", userId); var list = new List<Order>(); await using var reader = await cmd.ExecuteReaderAsync(ct); while (await reader.ReadAsync(ct)) { list.Add(new Order( reader.GetInt64(0), reader.GetInt64(1), reader.GetInt32(2), reader.GetDateTime(3))); } return list; } /// <summary>Cross shard fanout.</summary> public async Task<long> TotalRevenueCentsAsync(CancellationToken ct = default) { var tasks = _router.AllShards.Select(async shard => { await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = "SELECT COALESCE(SUM(amount_cents), 0) FROM orders"; var result = await cmd.ExecuteScalarAsync(ct); return Convert.ToInt64(result); }); var perShard = await Task.WhenAll(tasks); return perShard.Sum(); } } Program.cs using Microsoft.Extensions.Configuration; using Microsoft.Extensions.DependencyInjection; using Microsoft.Extensions.Hosting; using ShardingDemo; var builder = Host.CreateApplicationBuilder(args); // Bind Shards:[] from appsettings.json (override with user-secrets / env vars / Key Vault) var shardConfigs = builder.Configuration .GetSection("Shards") .Get<List<ShardConfig>>() ?? throw new InvalidOperationException("No 'Shards' section configured."); if (shardConfigs.Count == 0) throw new InvalidOperationException("At least one shard must be configured."); builder.Services.AddSingleton(_ => new ShardRouter(shardConfigs)); builder.Services.AddSingleton<UserRepository>(); using var host = builder.Build(); var repo = host.Services.GetRequiredService<UserRepository>(); var router = host.Services.GetRequiredService<ShardRouter>(); (long Id, string Email)[] users = { (1001, "ada@example.com"), (2002, "linus@example.com"), (3003, "grace@example.com"), (4004, "alan@example.com"), }; foreach (var (id, email) in users) { await repo.CreateUserAsync(id, email); Console.WriteLine($"user {id} -> shard {router.ShardForKey(id).Id}"); } await repo.AddOrderAsync(orderId: 9001, userId: 1001, amountCents: 4999); await repo.AddOrderAsync(orderId: 9002, userId: 1001, amountCents: 1299); await repo.AddOrderAsync(orderId: 9003, userId: 2002, amountCents: 8800); Console.WriteLine($"\nAda: {await repo.GetUserAsync(1001)}"); Console.WriteLine($"Ada's orders: {(await repo.GetOrdersForUserAsync(1001)).Count}"); Console.WriteLine($"\nTotal revenue across 3 shards: " + $"${await repo.TotalRevenueCentsAsync() / 100m:F2}"); await router.DisposeAsync(); Tracing One Request End to End Consider GetOrdersForUserAsync(1001): ShardForKey(1001) → MD5("1001") → first two bytes as a number → % 1024 → a bucket in the range 0..1023. bucket % 3 → a physical shard → for example mysql-shard-2.mysql.database.azure.com. The MySqlDataSource provides a pooled, TLS encrypted connection authenticated as shard2_admin. The query runs against shard 2's local ix_user index, with no fan out and at single server speed. Every call with userId = 1001, whether GetUser, AddOrder, or GetOrdersForUser, lands on the same shard. That is why orders JOIN users ON orders.user_id = users.user_id WHERE user_id = 1001 executes within a single shard, with no cross-shard traffic. Conclusion The essential point is this. Once a single primary can no longer absorb your write load, sharding becomes a durable answer, and implementing it at the application layer keeps every part of the system explicit and comprehensible. When write volume or dataset size outgrows a single primary, application layer sharding provides several benefits. N independent Azure Database for MySQL instances, each absorbing 1/N of the write traffic. Queries by user that remain on a single shard and behave like an ordinary, modestly sized database. A bucket map approach that allows you to add a fourth, fifth, or Nth shard later by relocating slices of data rather than rehashing the entire dataset. A failure of one shard that affects 1/N of your users rather than all of them. These benefits come at a genuine cost. You must generate identifiers in the application, global uniqueness requires a secondary lookup table, and aggregate queries fan out across shards. A cross shard write, one that must atomically update data on two different shards, can no longer rely on a single database transaction. Instead it needs an orchestrated sequence of local transactions, where each step carries a compensating action that undoes its effect if a later step fails. None of these are insurmountable. They are simply responsibilities you now assume. Sharding is a deliberate step to take only once a single primary has genuinely exhausted its write headroom. When you reach that point, the implementation in this post is a representative blueprint. Stay Connected We welcome your feedback and invite you to share your experiences or suggestions at AskAzureDBforMySQL@service.microsoft.com Thank you for choosing Azure Database for MySQL!56Views2likes0CommentsKB5089573 forced install + Lenovo network stack degradation (HTTPS latency, build 26200.8524)
KB5089573 installed automatically on a Lenovo system despite updates being paused and no preview updates enabled. After installation, the system jumped to build 26200.8524 and the network stack degraded severely. Heavy HTTPS sites (LinkedIn, Google Finance, YouTube) take 20–60 seconds to load across all browsers. Speedtest is normal, other devices on the same network are unaffected, and both Edge and Chrome show identical latency. DISM shows no package for KB5089573 and the update cannot be uninstalled. Looking for correlation data from other Lenovo users.102Views0likes3CommentsLessons Learned #540:Bulk Insert Throughput in Azure SQL Hyperscale with Partitioned Heap Tables
In this lesson learned, I would like to share an interesting scenario working on a service request where our customer was running a high-volume data load process in Azure SQL Database Hyperscale. The workload was based on a common pattern: Recreate a staging table. Load a large number of rows using bulk insert. The bulk insert showed unstable execution times and became the main area to investigate. The process was loading a very large number of rows into an Azure SQL Database Hyperscale database. The process used a staging table that was initially loaded as a heap. The main concern was the inconsistent execution time during the load process. Why Manually Adding Data Files Was Not the Right Direction In Azure SQL Database Hyperscale, the storage architecture is different from a traditional SQL Server deployment. The data layout and storage management are handled internally by the service. Because of this architecture, manually creating or pre-allocating multiple data files is not the same tuning option that we may consider in SQL Server on-premises or SQL Server running on Azure Virtual Machines. For this reason, the troubleshooting focus moved from manual file layout configuration to the actual workload pattern, waits, concurrency, batch size, and staging table design. What We Observed During the bulk insert phase, waits such as PAGELATCH_EX were observed. Since the staging table was loaded as a heap and the clustered primary key was created only after the bulk insert completed, OPTIMIZE_FOR_SEQUENTIAL_KEY was not directly applicable to the bulk insert phase. This changed the direction of the investigation. Instead of focusing on last-page insert contention on an existing clustered index, the analysis moved toward heap insert behavior, allocation contention, concurrency, batch size, and whether a different staging table design could help. First Recommendation: Start with Low-Impact Changes Before changing the table design, the first recommendation was to test the least intrusive changes: Reduce the number of concurrent bulk insert sessions. Increase the batch size, for example from 10,000 rows to 50,000 or 100,000 rows. Test TABLOCK on the dedicated heap staging table. The goal was to avoid assuming that more concurrency would always reduce the total execution time. In some high-volume load scenarios, excessive concurrency may increase contention and make the process less stable. The Interesting Design Option: Partitioned Heap Staging Table One of the most interesting design options was to evaluate a partitioned heap staging table. The idea is simple: instead of loading all rows into a non-partitioned heap staging table, the staging table can be created on the same partition scheme used by the target table, using the same partitioning column. This does not mean that a partitioned heap will always be faster. However, it can be a useful design option when: The bulk load phase is affected by allocation or latch contention. Concurrent load processes can naturally distribute rows across different partition ranges. The staging table is used only as an intermediate structure.Lessons Learned The main lessons from this scenario were: In Azure SQL Database Hyperscale, manually managing multiple data files is not the right tuning direction. PAGELATCH_EX during heap loading may point to concurrency or allocation-related contention. Reducing concurrency can sometimes improve total throughput. Larger batch sizes may provide better results than many small batches. TABLOCK on a dedicated heap staging table is a low-impact test worth evaluating. A partitioned heap staging table can be a valid second-phase design option when the load can be distributed across partition ranges. The best approach is to test small, measurable changes before introducing architectural redesigns. Final Thoughts A partitioned heap staging table can be a powerful option, but only when it is tested carefully and when the workload pattern can benefit from partition distribution.Performance Tuning Cold Starts, Scaling Delays, and Startup Latency in Azure Container Apps
Introduction There is a particular kind of frustration that comes not when your application fails to start, but when it starts too slowly. The container is running, the health probes pass, your monitoring shows green — and yet every few minutes a user somewhere in the world hits a request that takes 15 seconds to respond. Your support team starts getting tickets. Your SLA dashboard turns amber. This is the cold start problem, and it is one of the most widely discussed pain points with any serverless container platform. Azure Container Apps is no different. But what most engineers do not realize is that the cold start is only one part of the story. Scaling delays, inefficient image layers, wrong resource allocations, and misconfigured KEDA rules all compound to create latency spikes that feel indistinguishable from cold starts but have completely different root causes and fixes. In this part of the series, we break down each cause systematically and show you exactly how to address it. Understanding What "Cold Start" Actually Means in Container Apps Before we fix anything, it helps to understand what is actually happening during a cold start. When a new replica is created, Azure Container Apps needs to do several things in sequence before your application can serve a single request: The platform schedules the new replica on available infrastructure. The container runtime pulls the image layers that are not already cached on that node. The container starts and the process inside it begins executing. Your application framework initializes (the .NET DI container, Django's ORM layer, loaded ML models, etc.). The readiness probe passes, signaling that the replica can accept traffic. Every one of these steps takes time. The total duration is your cold start latency. When you have `minReplicas: 0`, this full cycle happens for every "first request after idle" scenario. With `minReplicas: 1`, steps 1 and 2 are already done, but steps 3–5 still happen whenever a new replica is created due to scaling out. Scenario 1: Requests Spike to 10+ Seconds After a Period of Inactivity What You See Everything looks fine during load testing, but the next morning after a quiet night, the first user to hit the app gets a timeout or a very slow response. You check your Application Insights or Log Analytics and you see exactly one request with a dramatically higher duration than all the others. Why This Happens You have `minReplicas` set to `0` (or it defaults to 0). When there are no replicas running and a new request arrives, the entire cold start sequence kicks off — and the request waits in the ingress queue the entire time. Depending on your image size and application initialization time, this can easily reach 15–30 seconds for a .NET application with a large DI graph, or even longer for a Python application that imports heavy libraries. The Fix Option A (Recommended for most workloads): Set `minReplicas` to 1. This ensures at least one replica is always warm and ready to handle requests. You will pay for that one replica's compute even during idle periods, but you eliminate the cold start for your users: az containerapp update --name my-dotnet-api --resource-group my-rg --min-replicas 1 --max-replicas 10 Or in your Container App YAML: scale: minReplicas: 1 maxReplicas: 10 rules: - name: http-scaling-rule http: metadata: concurrentRequests: "10" Option B: Reduce image size to speed up the pull. Every megabyte in your container image adds time to cold starts. A production .NET API should not be a 2 GB image. Use multi-stage builds to strip away the SDK, test tools, and development dependencies: # Stage 1: Build FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build WORKDIR /src COPY ["MyApp.csproj", "."] RUN dotnet restore COPY . . RUN dotnet publish -c Release -o /app/publish --no-restore # Stage 2: Runtime only - much smaller FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS final WORKDIR /app COPY --from=build /app/publish . # Run as non-root for security USER app EXPOSE 8080 ENTRYPOINT ["dotnet", "MyApp.dll"] For Django, the equivalent pattern is: FROM python:3.11-slim AS base # Install only production dependencies WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt && find /usr/local -name "*.pyc" -delete && find /usr/local -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null || true COPY . . RUN SECRET_KEY=placeholder python manage.py collectstatic --noinput USER nobody EXPOSE 8000 CMD ["gunicorn", "myproject.wsgi:application", "--bind", "0.0.0.0:8000", "--workers", "2"] Option C: Use a startup probe to manage the readiness window. If your app genuinely needs 20–30 seconds to initialize (loading configuration, warming caches, establishing connection pools), configure a startup probe separately from your liveness probe. This gives the container time to start without the liveness probe killing it prematurely: probes: - type: Startup httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 12 # 12 * 5s = 60 seconds total window - type: Liveness httpGet: path: /health port: 8080 periodSeconds: 10 failureThreshold: 3 - type: Readiness httpGet: path: /health/ready port: 8080 periodSeconds: 5 failureThreshold: 3 Scenario 2: New Replicas Lag Behind Traffic Spikes What You See Your application handles steady traffic just fine. Then a sudden burst arrives — a marketing email goes out, a scheduled batch job triggers API calls, or a downstream system fires webhooks — and for 30–60 seconds your error rate jumps and your latency spikes. After that window, everything recovers. The scaling logs show new replicas were created, but they came online too late. Why This Happens KEDA (the scaling engine behind Container Apps) works reactively. By default, HTTP-based scaling triggers new replicas when concurrent requests exceed the configured threshold. But there is an inherent delay between the moment traffic spikes, the moment KEDA detects it, and the moment a new replica is warm and serving traffic. This window is where your users experience the pain. Additionally, if your image pull takes a long time (large image, first pull on a new node), the new replica arrives even later. KEDA cannot compensate for slow image pulls. The Fix Step 1 — Tune your KEDA scaling rules to trigger earlier. Rather than waiting until you are already at capacity, configure scaling to trigger with a lower concurrency threshold. If your app can handle 20 concurrent requests comfortably, set the threshold to 10 so new replicas spin up before you are overwhelmed: scale: minReplicas: 1 maxReplicas: 20 rules: - name: http-rule http: metadata: concurrentRequests: "10" # Scale earlier, not at capacity For Azure Service Bus or Event Hubs-triggered scaling (common in job-style workloads), use a queue length threshold that gives you a buffer: scale: minReplicas: 0 maxReplicas: 30 rules: - name: servicebus-rule custom: type: azure-servicebus metadata: queueName: my-processing-queue namespace: my-servicebus-namespace messageCount: "5" # Scale when queue depth reaches 5, not 100 auth: - secretRef: servicebus-connection triggerParameter: connection Step 2 — Pre-warm your connection pools in .NET. One of the biggest contributors to new replica slowness is the time spent establishing database connections and other external connections. The first request that hits a new replica bears the cost of opening the connection pool. Configure your connection pool to warm up eagerly at startup: // In Program.cs, after building the app if (app.Environment.IsProduction()) { // Warm up the database connection pool before accepting traffic using var scope = app.Services.CreateScope(); var dbContext = scope.ServiceProvider.GetRequiredService<AppDbContext>(); await dbContext.Database.ExecuteSqlRawAsync("SELECT 1"); } await app.RunAsync(); Step 3 — Enable HTTP/2 keep-alive and connection reuse. In .NET applications running behind the Container Apps ingress, configure your HTTP client to use connection pooling properly: builder.Services.AddHttpClient("downstream-api", client => { client.BaseAddress = new Uri("https://my-downstream-service"); client.DefaultRequestVersion = new Version(2, 0); }) .ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler { PooledConnectionLifetime = TimeSpan.FromMinutes(5), PooledConnectionIdleTimeout = TimeSpan.FromMinutes(2), MaxConnectionsPerServer = 20 }); Scenario 3: Django Startup Is Slow Due to Import Time What You See Your Django application takes 8–12 seconds to start even on a warm node. You check the Gunicorn startup logs and see it spending most of that time in Python module imports before it ever processes a request. Why This Happens Python's import system is synchronous and single-threaded. When you import `django`, `rest_framework`, `pandas`, `numpy`, or any large library, Python reads and executes every module file in the dependency chain. A Django project with Django REST Framework, Celery, and a few third-party packages can easily spend 5–8 seconds just on imports. Multiply that by the number of Gunicorn workers (each is a separate process that imports everything independently) and startup time balloons. The Fix Step 1 — Profile import time to find the worst offenders. Add this to your Dockerfile's entrypoint or run it manually: # Run this in a container shell to see which imports take the longest python -X importtime -c "import django; django.setup()" 2>&1 | sort -k2 -rn | head -20 Step 2 — Use lazy imports for heavy dependencies that are not needed at startup. Instead of importing everything at the module level, defer imports to the functions that actually need them: # Instead of this at the top of your file: import pandas as pd import numpy as np # Do this — import only when the function is actually called: def process_data(data): import pandas as pd import numpy as np df = pd.DataFrame(data) return df.describe().to_dict() Step 3 — Reduce Gunicorn worker count for memory-constrained environments. Having too many workers means too many independent Python processes all importing everything at the same time. For Container Apps with 0.5–1 vCPU, 2 workers is usually the right starting point: CMD ["gunicorn", "myproject.wsgi:application", "--bind", "0.0.0.0:8000", "--workers", "2", "--worker-class", "gthread", "--threads", "4", "--timeout", "120", "--keep-alive", "5", "--log-level", "info"] Step 4 — Consider switching from Gunicorn to Uvicorn for async Django. If you are on Django 4.x with ASGI support, Uvicorn with async workers can handle significantly more concurrent requests per worker than synchronous Gunicorn workers: CMD ["uvicorn", "myproject.asgi:application", "--host", "0.0.0.0", "--port", "8000", "--workers", "2", "--log-level", "info"] Scenario 4: Resource Limits Are Causing Throttling and Slow Responses What You See Your application starts fine and handles light traffic well, but under moderate to heavy load — even well below your max replicas — individual requests become slow and CPU metrics show your replicas running near 100% utilization. You may also see the .NET GC (garbage collector) running very frequently, or Django showing slow database queries that are actually fast queries being delayed because the process has no CPU to run. Why This Happens Container Apps defaults to 0.25 vCPU and 0.5 Gi memory if you do not specify resource limits. For a production .NET API or a Django application serving real traffic, this is almost always too little. When a container hits its CPU limit, the container runtime throttles it — the process continues to run but gets less CPU time, making everything slower without any obvious error signal. The Fix Step 1 — Measure actual resource usage before guessing. Query Log Analytics for actual CPU and memory usage to establish a baseline: ContainerAppSystemLogs_CL | where ContainerAppName_s == "my-dotnet-api" | where TimeGenerated > ago(7d) | summarize AvgCpuUsage = avg(todouble(CpuUsageNanoCores_d)) / 1000000, MaxCpuUsage = max(todouble(CpuUsageNanoCores_d)) / 1000000, AvgMemoryMB = avg(todouble(MemoryWorkingSetBytes_d)) / 1048576, MaxMemoryMB = max(todouble(MemoryWorkingSetBytes_d)) / 1048576 by bin(TimeGenerated, 1h) | order by TimeGenerated desc Step 2 — Update resource allocations based on what you observed. az containerapp update --name my-dotnet-api --resource-group my-rg --cpu 0.5 --memory 1.0Gi Container Apps has specific valid CPU/memory combinations. The valid pairs are: `0.25 vCPU / 0.5 Gi`, `0.5 vCPU / 1.0 Gi`, `0.75 vCPU / 1.5 Gi`, `1.0 vCPU / 2.0 Gi`, and up to `4.0 vCPU / 8.0 Gi`. You cannot mix arbitrary values. Step 3 — Configure .NET GC for server workloads. By default, .NET uses the workstation GC mode which is tuned for interactive applications. For server containers, use server GC mode and configure the heap size appropriately: // In runtimeconfig.template.json or via environment variables { "configProperties": { "System.GC.Server": true, "System.GC.HeapHardLimit": 805306368, "System.GC.HighMemoryPercent": 75 } } Or as environment variables in your Container App: az containerapp update --name my-dotnet-api --resource-group my-rg --set-env-vars "DOTNET_GCConserveMemory=5" "DOTNET_GCHeapHardLimit=805306368" Measuring the Impact of Your Changes After making changes, use this Log Analytics query to track your startup times over the past 24 hours and confirm the improvements: ContainerAppConsoleLogs_CL | where ContainerAppName_s == "my-dotnet-api" | where Log_s contains "Application started" or Log_s contains "Now listening on" | project TimeGenerated, Log_s, ContainerName_s | order by TimeGenerated desc And check request duration percentiles in Application Insights: requests | where cloud_RoleName == "my-dotnet-api" | where timestamp > ago(24h) | summarize p50 = percentile(duration, 50), p90 = percentile(duration, 90), p99 = percentile(duration, 99), count = count() by bin(timestamp, 1h) | order by timestamp desc Summary: Your Performance Tuning Quick Reference Here is a quick decision guide based on what you are seeing: Symptom Most Likely Cause First Fix to Try First request after idle is very slow `minReplicas: 0` Set `minReplicas: 1` Spike period has errors, then recovers KEDA scaling too slow Lower concurrentRequests threshold New replicas start slowly Large image size Multi-stage Docker build High CPU at moderate traffic Under-allocated resources Increase CPU/memory allocation Django startup is slow Heavy Python imports Profile and defer imports .NET app slow under load Workstation GC mode Enable server GC References and Sample Resources Use these links to tune startup performance, scaling behavior, and runtime efficiency. Azure Container Apps docs (core) Scale applications in Azure Container Apps Workload profiles overview Health probes in Azure Container Apps Revisions in Azure Container Apps Monitoring and logging in Azure Container Apps Runtime and framework performance references Docker multi-stage builds .NET performance best practices for ASP.NET Core .NET runtime GC configuration Django performance optimization Uvicorn deployment guide Scaling engine references and samples KEDA concepts and documentation KEDA scaler samples Azure Samples: .NET on Azure Container Apps Azure Samples: Python on Azure Container Apps What's Next In Part 3, we go deeper into the most specialized and complex scenario in this series: troubleshooting AI workloads in Azure Container Apps. Loading large ML models, managing GPU and CPU resource constraints, and dealing with memory pressure from inference workloads all require techniques that go beyond standard web application troubleshooting. Part of the series: Troubleshooting Azure Container Apps in Production Next: Part 3 — Troubleshooting ML Model Loading, GPU Issues, and Memory Pressure in Azure Container AppsYour PostgreSQL workflow just found its new home in Cursor
TL; DR: Our Visual Studio Code extension for PostgreSQL is now available on the Open VSX registry: Cursor users get first-class database tooling without leaving the editor that already understands their code. The context switch problem If you use Cursor, you know the feeling. You’re deep in an agentic flow. Composer is scaffolding a feature across multiple files. Tab is anticipating your next move. Then you need to check a table's schema or run a quick query, so you switch to a different tool, and then you lose your flow state, and spend 30 seconds remembering which connection goes to which environment. That context switch is expensive. Not in minutes, but in momentum. Why we built for Cursor (and Open VSX) Cursor is built on the VS Code ecosystem, which means it supports VS Code extensions natively. It uses the Open VSX registry: an open, vendor-neutral extension marketplace where database tooling options have been limited. We saw an opportunity: bring a modern PostgreSQL extension directly to where developers do their most productive work. By publishing to Open VSX, we make sure that developers across the entire VS Code-compatible ecosystem, including Cursor, Windsurf, AWS Kiro, Theia, and Ona all have access without workarounds. Where AI-powered editing meets database awareness Here’s what gets interesting. Cursor indexes your entire codebase semantically. It knows your Drizzle schemas, your raw SQL files, and your migration directories. Our extension completes the picture by giving the editor a live connection to the actual database. Here’s where they intersect: Schema explorer in your sidebar: browse tables, columns, indexes, and functions without leaving the editor. When Cursor’s agent asks “what columns does the users table have?”, the answer is already visible. Screenshot: Object Explorer sidebar showing tables, columns, and indexes expanded Connection-aware IntelliSense: autocomplete table names, column names, and functions based on your live database schema. This pairs naturally with Cursor’s Tab completions: the AI writes the application logic, and IntelliSense validates the SQL. Inline EXPLAIN diagnostics: catch performance issues before they ship. Write a query and see whether it uses an index or triggers a sequential scan, all without running a separate tool. Zero-config connection discovery: we detect .env files, docker-compose.yml, and ORM connection strings in your project. Your database connection follows your workspace, not a global settings file buried three menus deep. Result export and inline execution: select SQL, run it, and see results in a clean panel. Export to JSON or CSV when you need to share findings with your team. Features that make Cursor + PostgreSQL even better Beyond the basics, the extension includes capabilities that pair especially well with AI-powered workflows: MCP server for AI assistants: the extension registers a Model Context Protocol (MCP) server, so Cursor’s agent can discover and interact with your PostgreSQL databases directly through a standardized tool interface. Ask your AI assistant to inspect a table, run a diagnostic query, or analyze a plan: it has the tools to do it. Agent Mode database tools: dedicated DBAgent MCP tools give AI assistants richer database-analysis capabilities, from schema introspection to performance diagnostics and instruction management. Query plan visualization: explore EXPLAIN output in four synchronized views: an interactive node graph, icicle chart, sortable table, and raw source. Color-coded severity groups expose bottlenecks at a glance, and AI-assisted analysis provides optimization suggestions. Performance dashboard: investigate database performance with DB load charts, query activity, wait-event analysis, session health, and blocking chains. Use natural language to inspect trends, identify bottlenecks, and generate diagnostic SQL. Object Explorer search: find database objects by name without expanding the tree. Search across connections, databases, and schemas. Filter by object type or schema name and navigate directly to any result. Schema-aware “New Query”: right-click a schema in Object Explorer to open a new query with the appropriate search_path already set. No more manual SET search_path before writing queries. Multi-source connection profiles: save connection profiles to your user settings, workspace, or folder. Check workspace profiles into source control so every team member gets the right database connections when they open the project. SSH tunneling built in: connect to databases on private networks through SSH tunnels configured directly in the connection dialog, with ssh-agent support for private key authentication. Built for how you actually work Modern development means ephemeral environments, branch-specific databases, and containers everywhere. The extension is designed around this reality: Automatic detection of PostgreSQL instances running in Docker Project-scoped connections that travel with your workspace Support for standard PostgreSQL connections Integration with both local and cloud versions of PostgreSQL from multiple vendors, and first-class support for Azure Database for PostgreSQL and Azure HorizonDB, with provisioning, backup management, and network configuration: all without leaving the editor. Status bar indicator showing your active database at a glance Get started Install from the Open VSX Registry: search for it in Cursor’s extension panel or install the .vsix directly. Your existing VS Code workflow carries over unchanged. If you’re already using Cursor for its agentic capabilities, adding database awareness to the editor means fewer tabs, fewer context switches, and a tighter feedback loop between your application code and the data layer underneath it. Available now on Open VSX. Works with Cursor, Antigravity, and all the VS Code compatible editors.2.1KViews4likes0CommentsSELECT * FROM build2026_sessions WHERE postgres = true;
Microsoft Build 2026 is around the corner, and this year it’s shaping up to be a big one for PostgreSQL experts and enthusiasts. If you’re a developer working with Postgres, or just love exploring new database technology, there's plenty to get excited about. Microsoft’s new cloud-first evolution of PostgreSQL, Azure HorizonDB, alongside sessions featuring Azure Database for PostgreSQL, will highlight how Postgres is powering the next wave of AI-driven applications. A new horizon in Postgres Build 2026 arrives at a time when the role of databases in modern apps is evolving rapidly. From enabling AI model integration to scaling seamlessly across the cloud, PostgreSQL developers today are dealing with more complex demands than ever. That’s why Azure HorizonDB – Microsoft’s new cloud-native PostgreSQL service – is generating so much buzz ahead of Build. What is Azure HorizonDB? In short, it’s a reimagined version of PostgreSQL designed for cloud-scale performance and AI-era workloads. Azure HorizonDB, introduces a distributed architecture that decouples compute and storage, delivering sub-millisecond latencies and three times the throughput of self-managed Postgres at massive scale. It aims to preserve Postgres’s beloved features and SQL ecosystem while adding next-generation capabilities: built-in vector indexing for high-speed AI/ML retrieval, the ability to run AI models and vector operations directly in the database, and multi-zone replication for resilience. For Postgres developers, this means less time stitching together external data stores or machine learning services – and more time building powerful apps on a unified platform that’s both familiar and built for the future. The bottom line: Microsoft Build 2026 is an ideal opportunity for developers to see Azure HorizonDB in action, learn best practices for modern PostgreSQL architectures, and understand how to leverage Postgres in new scenarios like generative AI and multi-agent applications. Read on for a rundown of sessions covering these topics, complete with what you’ll learn from each one. Top sessions for PostgreSQL databases on Azure Below are key sessions tailored for PostgreSQL users and those interested in Azure HorizonDB, with session types and highlights of what you’ll gain by attending. 🎤 Breakout: From Rows to Reasoning: Designing Databases for AI Apps and Agents (BRK223, 45 min, in-person and digital options) Discover how to architect databases that can power tomorrow’s intelligent applications. This technical breakout will show how AI-ready databases can move beyond plain transactions. You’ll see live demos of integrating transactional, analytical, and vector data in one unified platform, with Azure’s new database capabilities, including Azure HorizonDB. Learn how to simplify your stack by eliminating separate analytics engines or vector stores. The session will highlight patterns that reduce data movement and latency so your apps can efficiently reason over live data with minimal complexity. 🧪 Hands-on lab: Create Advanced Postgres-Powered Agentic Apps with Azure HorizonDB (LAB511, 75 min, in person and digital options) Roll up your sleeves and get hands-on building a real multi-agent AI application with Postgres. In this advanced lab, you’ll create a production-ready AI agent powered by Azure HorizonDB as an all-in-one data, search, and intelligence layer. Experiment with retrieval-augmented generation (RAG) by combining semantic vector search (DiskANN) with traditional SQL queries right inside the database. Implement hybrid search and agent workflows without resorting to external vector databases or glue code – thanks to HorizonDB’s built-in vector indexing and in-database AI model capabilities. This lab is perfect for developers who want to experience how HorizonDB can simplify your stack and boost performance for AI-driven apps. Multiple hands-on labs are offered to suite your schedule. 💻 Demo: Simplify App Dev with Cloud-Native PostgreSQL in Azure HorizonDB (DEM364, 25 min, in-person and digital options) See how to cut your development time and complexity with built-in AI and search features in Postgres. This fast-paced demo shows how Azure HorizonDB helps eliminate the need for separate search engines and AI services by pulling those capabilities straight into PostgreSQL. Expect to learn how you can run hybrid vector + keyword queries using SQL, integrate AI models directly from within the database, and apply full-text search (BM25) and semantic ranking to get smarter results. If you’re eager to deliver intelligent apps faster, with fewer moving parts, this session will show how HorizonDB simplifies your architecture without sacrificing performance. ⚡Lightning Talk: Cloud-Native PostgreSQL, Rebuilt for Scale: Azure HorizonDB (LTG413, 15 min, in-person only) Get a rapid-fire introduction to the architecture behind HorizonDB’s eye-popping performance. This short talk dives into how HorizonDB re-architects core PostgreSQL to deliver effortless scale out and blazing speed. Learn how decoupled compute and storage, predictive caching, and multi-region replication combine to achieve sub-millisecond query latencies and 3× higher throughput than standard Postgres. If you care about performance tuning and high-scale database design, don’t miss this quick primer on the tech under HorizonDB’s hood. 👥 Interactive Table Talk: Scaling PostgreSQL for AI Apps: Patterns and Tradeoffs (TT622, 45 min, in-person only) Bring your questions and ideas to this collaborative discussion. In this open round-table session with community and Microsoft experts, you’ll explore architecture patterns for scaling PostgreSQL to meet the demands of agent-based and AI-driven applications. Discuss real-world strategies for handling vector embeddings in Postgres, unifying relational and document data, integrating with AI models, and more. Compare the trade-offs between different scaling approaches – from monolithic to microservices, sharding strategies, and new technologies like HorizonDB – and learn where each design shines or struggles in production. Come ready to share your experiences and learn from others in the room. ▶️ On-Demand: Smarter PostgreSQL Migrations to Power Modern, Intelligent Apps (OD822, 30 min, digital only) Planning to migrate to Postgres or move your databases to Azure? Start here. This on-demand session focuses on new tools and proven strategies to migrate large-scale databases to Azure Database for PostgreSQL quickly and safely. You’ll see AI-assisted migration tools in action that minimize downtime and risk when moving terabytes of data. Just as importantly, you’ll learn how migrating to Azure unlocks advanced capabilities – from boosted performance and enhanced security to AI-ready features – helping you turn your newly migrated data into intelligent apps and services. On-demand session will be available to stream on the first day of Build. Meet the team: PostgreSQL expert meetups If you’re attending Build in person, stop by the Expert Meetup (EMU) area and head to the relational cloud databases booth. This is your chance to talk directly with the engineers and product teams behind PostgreSQL on Azure. Bring your questions about architecture decisions, scaling patterns, migrations, AI workloads, or anything else on your mind. Whether you want to sanity-check a design, dig deeper into something you saw in a session, or give direct feedback, the EMU space is designed for exactly that convo. How to get the most out of Build (and what to do next) With so much great content lined up, how do you decide where to start? It really depends on what you’re most excited about: Curious about AI and agentic apps: Start with From Rows to Reasoning, then go deeper with the Simplify App Dev with HorizonDB demo or get hands-on at the Azure HorizonDB labs to see how these ideas work in practice. Performance and scale are your focus: The short Lightning Talk on HorizonDB’s cloud-native architecture and the Table Talk on scaling Postgres will both provide unique insights and pro tips for running Postgres at massive scale. Planning a migration to PostgreSQL on Azure: Watch the Smarter PostgreSQL Migrations on-demand session to learn how to migrate large workloads with minimal downtime, and the benefits you can unlock after moving to Azure. Looking for real answers to your specific questions: Make time for the PostgreSQL Expert Meetup area to connect directly with the team. No matter which sessions you choose, Build 2026 promises to be an exciting event for the PostgreSQL developer community. Browse the session catalog, save the sessions that match your interests, and we’ll see you at Build.755Views2likes0CommentsBuilding and Operating a Microsoft Foundry Hosted Agent with GitOps and GitHub Tasks
The Gap Between Prototype and Production Most AI engineering teams can build a working agent in a day. The hard part is not building it; the hard part is operating it. Prompts drift. Tool configurations change without review. Deployments happen from someone's laptop. There is no audit trail, no rollback plan, and no consistent way to promote a change from a development environment to production. GitOps closes that gap. By treating your agent definition, configuration, and infrastructure as version-controlled source code, you get the same delivery discipline that software engineering teams have applied to application code for years. Every change is reviewed, every deployment is automated, and every environment state is traceable to a specific commit. This post shows you how to apply GitOps principles to a Microsoft Foundry Hosted Agent using GitHub as the source of truth and GitHub Tasks and Actions as the automation layer. The result is a repeatable, governed, production-ready delivery model for AI agents. What Is a Microsoft Foundry Hosted Agent? Microsoft Foundry is Microsoft's platform for building, deploying, and operating AI applications and agents. A Hosted Agent is an agent runtime managed by the Foundry platform rather than self-hosted by your team. You supply the agent logic, configuration, and tools; Foundry handles the runtime lifecycle, scaling, and managed infrastructure. In practical terms, a Foundry Hosted Agent is a containerised agent application. You package your agent code, prompt definitions, tool bindings, and environment configuration into a container image. Foundry deploys and manages that container within a Foundry project, connected to models, tools, and observability infrastructure that the platform provides. Teams choose Hosted Agents over self-hosting because: The platform manages runtime infrastructure, patching, and scaling Integration with Azure AI models, managed identity, and observability is built in You can focus engineering effort on agent logic rather than cluster management Foundry projects provide environment and resource isolation without requiring you to provision and manage separate Azure resources for each environment Hosted Agents are a good fit when your team wants strong operational support with minimal platform overhead, when you need clear separation between environments, and when your agents depend on Azure AI capabilities such as Azure OpenAI Service, Azure AI Search, or Model Context Protocol integrations. Why GitOps Matters Specifically for AI Agents GitOps is straightforward for stateless web services: the code changes, the pipeline runs, the container is deployed. AI agents are more complex because there are multiple distinct artefacts that all affect agent behaviour: System prompts and instruction files Tool definitions and external integrations Model selection and configuration (temperature, max tokens, safety settings) Model Context Protocol (MCP) server definitions Orchestration logic and agent workflow code Safety and policy settings Infrastructure and deployment configuration Any one of these can change the behaviour of your agent in ways that are difficult to detect without structured review. A prompt change that looks harmless can alter tone, scope, or factual grounding. A tool configuration change can expose data to unintended callers. A model upgrade can shift response quality unpredictably. Git gives you a single place to version, review, and approve all of these artefacts together. Pull requests give you a structured review gate. Workflow automation gives you validation before anything reaches a deployed environment. Tags and releases give you deployment markers you can roll back to. The discipline of GitOps turns what is often an ad-hoc AI delivery process into a repeatable engineering practice. Reference Architecture The following diagram shows a practical reference architecture for delivering a Microsoft Foundry Hosted Agent through a GitOps model using GitHub. +---------------------------+ | GitHub Repository | | /src /agents /tools | | /prompts /infra | | /.github/workflows | +---------------------------+ | | Pull Request / Push to main v +---------------------------+ | GitHub Actions | | 1. Validate agent config | | 2. Lint and scan code | | 3. Run unit tests | | 4. Build container image | | 5. Push to registry | +---------------------------+ | | Image tag (SHA or semver) v +---------------------------+ | Azure Container Registry | | myregistry.azurecr.io | | my-agent:<sha> | +---------------------------+ | +------+------+ | | v v +----------+ +----------+ | Foundry | | Foundry | | Dev | | Test | | Project | | Project | +----------+ +----------+ | Approval gate (GitHub env) | v +----------+ | Foundry | | Prod | | Project | +----------+ | v +---------------------------+ | Observability | | Azure Monitor / App | | Insights / Foundry Logs | +---------------------------+ Key design decisions in this architecture: The GitHub repository is the single source of truth for all agent artefacts No human deploys directly to any Foundry project; all changes flow through automation Environment promotion requires a GitHub environment approval, creating a governance gate The container image is built once and promoted across environments; the image is not rebuilt per environment Secrets are stored in Azure Key Vault and accessed by the Foundry agent at runtime via managed identity Figure: GitOps delivery pipeline stages from commit to production Repository Structure A well-structured repository separates agent logic from infrastructure and tooling from prompts. The following structure works well in practice: my-foundry-agent/ ├── .github/ │ ├── workflows/ │ │ ├── validate.yml # Runs on every PR │ │ ├── build-deploy.yml # Runs on merge to main │ │ └── rollback.yml # Manual trigger workflow │ └── CODEOWNERS # Review assignments by path ├── src/ │ ├── agents/ │ │ ├── agent.py # Agent entry point and orchestration │ │ └── agent_config.json # Agent metadata and settings │ ├── tools/ │ │ ├── search_tool.py # Tool implementations │ │ └── data_tool.py │ └── prompts/ │ ├── system.txt # System prompt (versioned as plain text) │ └── instructions.txt # Supplementary instructions ├── tests/ │ ├── unit/ # Unit tests for tools and logic │ ├── integration/ # Integration tests against a running agent │ └── smoke/ # Post-deployment smoke tests ├── infra/ │ ├── main.bicep # Foundry project and resource definitions │ └── environments/ │ ├── dev.parameters.json │ ├── test.parameters.json │ └── prod.parameters.json ├── scripts/ │ ├── validate_agent.py # Config validation script │ └── smoke_test.py # Smoke test runner ├── Dockerfile # Container image definition └── docs/ └── architecture.md # Architecture and runbook documentation What belongs where and why: /src/prompts - System prompts as plain text files. Versioning prompts as files means every change goes through a pull request with a diff review, just as code does. /src/agents - Agent orchestration logic and configuration. Keeps the entry point and agent metadata co-located. /src/tools - Tool implementations separated from agent logic. Tool logic changes independently and should be reviewable in isolation. /infra - Infrastructure as code with per-environment parameter files. Environment-specific values live here, never in source files. /tests - Three layers of testing: unit tests for tools, integration tests for the full agent, and smoke tests that run against a deployed environment. /.github/workflows - All automation defined as code. There should be no manual deployment steps that live outside this directory. GitHub Tasks Across the Delivery Lifecycle GitHub Tasks and Issues provide the work tracking layer on top of the GitOps delivery model. Used well, they connect the intention behind a change to its implementation and deployment history. Practical patterns for using GitHub Tasks with agent delivery: Prompt change task - Open an issue to describe why the system prompt is changing. The pull request that changes system.txt closes that issue, creating a permanent link between the rationale and the diff. Tool integration task - When adding a new MCP server or external tool integration, create a task that captures the design decision, security review outcome, and test evidence before the pull request is merged. Model upgrade task - When upgrading the underlying model version, create a task that includes evaluation results and comparison data. The task becomes part of your change audit trail. Rollback task - If a deployment causes quality regressions, create a task to track the rollback, root cause investigation, and corrective action. Automation can open this task automatically when a deployment fails health checks. Dependency on approval - GitHub Tasks can be linked to environment approvals in GitHub Actions. A task in a specific milestone or project column can gate a promotion workflow. The key insight is that GitHub Tasks are not just work management; they are part of your audit trail. A regulatory or security reviewer can follow the chain from a production deployment back through workflow runs, pull request reviews, and the original task that described the intent of the change. End-to-End GitOps Flow The following walk-through describes a realistic developer experience for changing an agent prompt and promoting it to production. A developer opens a GitHub Issue describing the prompt change required and the expected behaviour improvement. The developer creates a feature branch, edits src/prompts/system.txt , and updates any related unit tests. A pull request is opened. The validate workflow runs immediately, checking prompt length, configuration schema, and lint rules. Unit tests run against the changed files. A code reviewer approves the pull request. The CODEOWNERS file ensures that prompt changes require review from the AI engineering team, not just any contributor. On merge to main, the build workflow runs: the container image is built with the new prompt baked in, tagged with the commit SHA, and pushed to Azure Container Registry. The deployment workflow deploys the new image to the Foundry Dev project automatically. Integration and smoke tests run against the deployed dev agent. If tests pass, the workflow pauses at the Test environment gate and requests approval from a named reviewer. After approval, the same image is deployed to Foundry Test. Smoke tests run again. A second approval gate controls promotion to Foundry Prod. If at any point a health check or smoke test fails, the rollback workflow redeploys the previous image tag from the registry. The image tag of the last known-good deployment is stored as a GitHub environment variable. This flow means that no human ever deploys directly to any environment. Every environment state is traceable to a specific commit, image tag, and workflow run. Security and Governance AI agents often have access to sensitive data and external systems. Security and governance cannot be an afterthought. Identity and Access Use managed identity for the Foundry Hosted Agent to access Azure resources. Avoid service principal secrets where Microsoft Entra Workload Identity or managed identity is available. Apply the principle of least privilege: the agent identity should have read access to data sources and limited write access only where the use case requires it. Tool integrations that require API keys or external credentials should retrieve them from Azure Key Vault at runtime, never from environment variables baked into the image. Secrets and Configuration Store secrets in Azure Key Vault. Reference them in your Foundry project configuration using Key Vault references. Store GitHub Actions secrets using repository or environment-scoped secrets. Never echo secrets in workflow logs. Separate environment configuration (endpoints, resource names, capacity settings) from agent logic. Use the /infra/environments/ parameter files for this. Auditability and Review Enforce pull request reviews for all changes to /src/prompts , /src/agents , and /infra via CODEOWNERS. Require status checks to pass before merging. Blocked merges prevent untested changes reaching production. GitHub's workflow run history gives you a complete deployment audit trail. You can answer "what was deployed to prod on Tuesday and who approved it" in seconds. For regulated environments, consider branch protection rules that require signed commits. Safe Rollout Use canary or blue-green patterns where Foundry supports them for high-traffic agents. Always keep the previous image tag available in the registry. Do not delete images on deployment. Document and test your rollback procedure before you need it in production. Observability and Operational Readiness A deployed agent that you cannot observe is an agent you cannot operate. Build observability in from the start. What to Monitor Deployment health - Track whether each Foundry deployment succeeded and the agent is responding. Wire deployment outcomes back to GitHub workflow run status. Model and tool errors - Log tool call failures, model timeout errors, and safety filter activations. Aggregate these in Azure Monitor or Application Insights. Latency - Track end-to-end response latency per agent version. A latency increase after a model or prompt change is an early signal of a quality regression. Token consumption - Monitor token usage per request and per session. Unexpected increases can indicate prompt injection or runaway orchestration loops. Traceability - Log which agent version handled each request. Correlation between the image tag and request traces is essential for debugging production issues. Debugging and Alerting Use structured logging with a consistent schema. Include fields for agent version, session ID, tool called, and outcome. Set up alerts for error rate thresholds and latency percentiles. Alert before users notice the problem. For failed agent runs, ensure logs capture the full conversation context (within your data retention policy) so that developers can reproduce and diagnose the failure. Microsoft Foundry Toolboxes One of the most important additions to the Foundry platform is Toolboxes, currently in Public Preview. If you have ever seen an agent codebase where three different agents each wire the same search tool with their own credentials and slightly different configurations, you already understand the problem Toolboxes solve. A Toolbox is a named, versioned bundle of tools managed centrally in Microsoft Foundry. You define the tools once, configure authentication and access centrally, and publish a single MCP-compatible endpoint. Any agent in any runtime consumes that endpoint without per-tool wiring, custom SDK integration, or duplicated credential management. Figure: Before and after Foundry Toolboxes. Each agent previously managed its own tool connections. With Toolboxes, agents connect to one governed endpoint. The Four Pillars Discover (coming soon) - Find approved tools without browsing long catalogues. Reduces duplication by surfacing what already exists before developers build something new. Build (available today) - Select tools into a named toolbox. Supported types include built-in tools (Web Search, Code Interpreter, File Search, Azure AI Search), MCP servers, Agent-to-Agent (A2A) endpoints, and OpenAPI-defined services. Consume (available today) - A single MCP-compatible endpoint exposes every tool in the toolbox to any agent runtime. Agents that can speak MCP can use a Foundry Toolbox without any Foundry-specific SDK dependency. Govern (coming soon) - Centralised authentication and observability applied to every tool call flowing through the toolbox. Security and platform teams get consistent controls without asking developers to bolt governance onto every agent individually. Toolboxes and GitOps: A Natural Fit Toolboxes are particularly well-suited to a GitOps delivery model because the toolbox definition is a discrete, versioned artefact. Instead of credentials and tool configuration scattered across agent codebases, the toolbox becomes its own managed entity with its own version history. The key design property is that the toolbox endpoint URL is stable. When you promote a new toolbox version to be the default, agents consuming the endpoint pick up the update without any code changes. This means you can update tool configuration, add a new MCP server, or rotate credentials in the toolbox without redeploying every agent that uses it. Figure: Toolbox versioning in a GitOps model. Commits trigger CI validation and deployment of new toolbox versions. The stable endpoint URL allows agents to consume updates without redeployment. Adding a Toolbox to Your Repository In your GitOps repository, toolbox definitions belong in /src/tools/toolbox_config.py or as a declarative configuration file checked into version control. The following example creates a toolbox that combines web search, Azure AI Search over internal documentation, and a GitHub MCP server: # src/tools/toolbox_config.py # Run this via CI to create or update a toolbox version in Foundry. from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient import os client = AIProjectClient( endpoint=os.environ["FOUNDRY_PROJECT_ENDPOINT"], credential=DefaultAzureCredential() ) toolbox_version = client.beta.toolboxes.create_toolbox_version( toolbox_name="customer-feedback-toolbox", description="Tools for triaging customer feedback: search, docs, and GitHub.", tools=[ { "type": "web_search", "description": "Search approved public documentation sites.", "custom_search_configuration": { "project_connection_id": os.environ["BING_CONNECTION_NAME"], "instance_name": os.environ["BING_INSTANCE_NAME"] } }, { "type": "azure_ai_search", "name": "product-manuals-search", "description": "Search internal product documentation.", "azure_ai_search": { "indexes": [ { "index_name": os.environ["SEARCH_INDEX_NAME"], "project_connection_id": os.environ["SEARCH_CONNECTION_ID"] } ] } }, { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp", "project_connection_id": os.environ["GITHUB_CONNECTION_ID"] } ], ) print(f"Toolbox version created: {toolbox_version.version}") print(f"MCP endpoint: {toolbox_version.mcp_endpoint}") To promote a toolbox version to be the default (the endpoint agents use without specifying a version), add this to your deployment workflow: # Promote toolbox version to default after validation toolbox = client.beta.toolboxes.update( toolbox_name="customer-feedback-toolbox", default_version=toolbox_version.version, ) print(f"Default version is now: {toolbox.default_version}") The stable endpoint for agents consuming this toolbox is: https://<your-project>.services.ai.azure.com/api/projects/<project>/toolbox/customer-feedback-toolbox/mcp?api-version=v1 Attaching the Toolbox to Your Hosted Agent In your agent code, connect to the toolbox via a single MCP tool definition. The agent gains access to every tool in the toolbox without knowing their individual configurations: # src/agents/agent.py (relevant excerpt) from agent_framework import MCPStreamableHTTPTool import httpx, os toolbox_endpoint = os.environ["FOUNDRY_TOOLBOX_ENDPOINT"] http_client = httpx.AsyncClient( auth=_ToolboxAuth(token_provider), # Microsoft Entra bearer token timeout=120.0, ) mcp_tool = MCPStreamableHTTPTool( name="toolbox", url=toolbox_endpoint, http_client=http_client, load_prompts=False, ) # Agent now has access to web search, AI Search, and GitHub MCP # through one tool definition and one authenticated connection. GitOps Workflow Extension for Toolboxes Add a dedicated job to your build-deploy workflow to create and promote toolbox versions as part of the same CI/CD pipeline: deploy-toolbox: name: Deploy Toolbox Version needs: validate runs-on: ubuntu-latest environment: dev permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_DEV }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Create toolbox version in Foundry env: FOUNDRY_PROJECT_ENDPOINT: ${{ vars.FOUNDRY_PROJECT_ENDPOINT_DEV }} BING_CONNECTION_NAME: ${{ vars.BING_CONNECTION_NAME }} BING_INSTANCE_NAME: ${{ vars.BING_INSTANCE_NAME }} SEARCH_INDEX_NAME: ${{ vars.SEARCH_INDEX_NAME }} SEARCH_CONNECTION_ID: ${{ vars.SEARCH_CONNECTION_ID }} GITHUB_CONNECTION_ID: ${{ vars.GITHUB_CONNECTION_ID }} run: python src/tools/toolbox_config.py Key points to note: Toolbox configuration is Python code in source control, reviewed through pull requests like any other change Connection IDs and index names are environment variables from GitHub Actions variables, not hardcoded in the script The same script runs for dev, test, and prod with different environment variable bindings Toolbox version promotion is a separate step from agent deployment, so you can update tools independently of the agent container Because the toolbox endpoint is stable, rolling back a toolbox version does not require rolling back the agent image Common Pitfalls Teams adopting this pattern commonly make the following mistakes. Identifying them early saves significant operational pain later. Treating prompts as unmanaged text. If your system prompt lives in a portal text box rather than a versioned file, you have no history, no review process, and no rollback capability. Move prompts into source control on day one. Deploying manually from the portal. Even one manual deployment breaks the GitOps contract. Your repository no longer reflects the true state of the environment. Automate everything and remove portal deployment permissions from individuals. Mixing environment configuration into source files. Hardcoded endpoint URLs or model deployment names in agent_config.json mean your dev and prod configurations diverge at the source level. Use parameter files and environment variables resolved at deployment time. Poor separation between agent logic and tool logic. When agents and tools are tightly coupled in a single file, a tool change requires a full agent review and redeployment. Keep them separate so they can evolve independently. Not versioning your Toolbox definition. Defining a Foundry Toolbox interactively through the portal gives you no audit trail and no rollback path. The toolbox configuration script belongs in source control alongside your agent code. Skipping evaluation before promotion. Deploying a prompt change without running a structured evaluation against a representative test set is how regressions reach production. Build evaluation into the pull request workflow, not just the deployment workflow. No rollback plan. If your first rollback is unplanned and urgent, it will be slow and stressful. Test your rollback procedure in a non-production environment and document the steps. Ignoring token and cost signals. AI workloads have variable cost profiles. A change that doubles average token consumption per request may be functionally correct but economically unsustainable. Monitor consumption as a first-class signal. Example GitHub Actions Workflow The following workflow runs on pull request validation and on merge to main. It covers the core delivery lifecycle: validate, build, deploy to dev, and smoke test. # .github/workflows/build-deploy.yml name: Build and Deploy Foundry Hosted Agent on: push: branches: - main pull_request: branches: - main env: REGISTRY: myregistry.azurecr.io IMAGE_NAME: my-foundry-agent jobs: validate: name: Validate Agent Configuration runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.12" - name: Install dependencies run: pip install -r requirements.txt - name: Validate agent config schema run: python scripts/validate_agent.py - name: Run unit tests run: pytest tests/unit/ -v - name: Lint code run: ruff check src/ build: name: Build and Push Container Image needs: validate runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' permissions: id-token: write contents: read outputs: image_tag: ${{ steps.meta.outputs.version }} steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Log in to Azure Container Registry run: az acr login --name ${{ env.REGISTRY }} - name: Extract metadata id: meta uses: docker/metadata-action@v5 with: images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} tags: | type=sha,format=short - name: Build and push image uses: docker/build-push-action@v7 with: context: . push: true tags: ${{ steps.meta.outputs.tags }} deploy-dev: name: Deploy to Foundry Dev needs: build runs-on: ubuntu-latest environment: dev permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_DEV }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy agent to Foundry Dev project run: | az ai foundry agent deploy \ --project ${{ vars.FOUNDRY_PROJECT_DEV }} \ --image ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }} \ --environment dev - name: Run smoke tests against dev run: pytest tests/smoke/ -v --base-url ${{ vars.AGENT_URL_DEV }} deploy-test: name: Deploy to Foundry Test needs: deploy-dev runs-on: ubuntu-latest environment: test permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_TEST }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy agent to Foundry Test project run: | az ai foundry agent deploy \ --project ${{ vars.FOUNDRY_PROJECT_TEST }} \ --image ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }} \ --environment test - name: Run smoke tests against test run: pytest tests/smoke/ -v --base-url ${{ vars.AGENT_URL_TEST }} Key decisions in this workflow: Validation runs on every pull request, not just on merge. Fast feedback catches problems before review. The container image is built once and the image tag is passed forward to deployment jobs. The same artefact is promoted across environments. Authentication uses OIDC federated credentials via azure/login@v3 with id-token: write permissions. No long-lived secrets are stored in GitHub for Azure authentication. The environment: test directive in the deploy-test job triggers a GitHub environment approval gate. A named reviewer must approve before the job runs. Smoke tests run after every deployment. A failed smoke test prevents further promotion. Best Practices Checklist Use this checklist when adopting the GitOps pattern for a Microsoft Foundry Hosted Agent: All agent artefacts, including prompts, tool definitions, model configuration, and Toolbox configuration scripts, are committed to source control No manual deployments to any environment; all changes flow through GitHub Actions workflows Pull request reviews are enforced for all changes to agent logic, prompts, and infrastructure via CODEOWNERS Unit tests cover tool logic; integration tests cover end-to-end agent behaviour; smoke tests cover deployed environments Container images are built once per commit and promoted across environments; images are not rebuilt per environment Environment configuration (endpoints, resource names) lives in parameter files, never in source code Secrets are stored in Azure Key Vault and accessed via managed identity at runtime GitHub environment approval gates control promotion from dev to test to prod Foundry Toolboxes are used to centralise tool definitions, credentials, and access governance across all agents; the toolbox configuration script is version-controlled and deployed through CI/CD Toolbox versions are promoted via the update default_version API step in the deployment workflow, not manually through the portal Latency, error rate, and token consumption are monitored with alerting thresholds The rollback procedure is documented, automated, and has been tested in a non-production environment GitHub Issues are used to record the intent behind significant changes and link to the pull requests that implement them Branch protection rules prevent direct pushes to main and require status checks to pass before merge The previous image tag is retained in the registry and stored as a GitHub environment variable for rollback Conclusion A Microsoft Foundry Hosted Agent is not something you deploy once and forget. Prompts evolve, tools change, models are upgraded, and policy requirements shift. Every one of those changes has the potential to alter agent behaviour in ways that affect users, costs, and compliance posture. GitOps, implemented through GitHub and GitHub Tasks, gives you the operational discipline to manage that complexity. Source control for all artefacts. Pull request review for every change. Automated validation, build, and deployment. Environment promotion gates. A complete audit trail from task to production. These are not bureaucratic overhead; they are the foundation of reliable, trustworthy AI agent operations. The teams that operate AI agents well are the ones that treat them like production software from the start. The investment in pipeline, structure, and governance pays back every time a change goes smoothly, every time a rollback takes minutes rather than hours, and every time a security or compliance reviewer can answer their question from a pull request history rather than a support ticket. Build the discipline in early. Your future self, and your production environment, will benefit from it. References Microsoft Foundry documentation Microsoft Foundry Agent Service documentation Microsoft Foundry Toolboxes documentation Introducing Toolboxes in Foundry (Microsoft Developer Blog) GitHub Actions documentation GitHub Projects and Tasks documentation Azure Container Registry documentation Azure Key Vault documentation Microsoft Entra Managed Identities documentation OpenGitOps PrinciplesWhy do I see many VDI_CLIENT_WORKER sessions in Azure SQL Database — and do they impact performance?
Sometimes you’ll notice many sessions showing the command VDI_CLIENT_WORKER in Azure SQL Database—often around scaling, replica/copy workflows, or internal seeding operations. These sessions can look alarming, especially during a performance investigation, but they are typically internal background workers. This post explains how to recognize them, what’s safe to do (and what isn’t), and how to focus on the real bottlenecks like blocking/deadlocks or log rate throttling when you’re troubleshooting slowness. Why you might see VDI_CLIENT_WORKER sessions in Azure SQL Database The symptom You run a session query (for example, using sys.dm_exec_requests or a monitoring tool) and observe: Many sessions with command text VDI_CLIENT_WORKER They may appear to be “stuck,” persist longer than expected, and can’t be killed Teams may worry these sessions are “the cause” of slowness Why it shows up in Azure SQL In Azure SQL, VDI_CLIENT_* wait types and VDI_CLIENT_WORKER sessions are commonly associated with platform operations that involve copying/seeding—for example: Scaling operations (service objective changes) Geo-replication / copy workflows Replica seeding-like behaviors Important: The presence of these sessions does not automatically mean they are the bottleneck. How to validate whether VDI_CLIENT_WORKER is benign? 1) Correlate to recent platform operations. Ask: did you recently perform (or did the platform perform) one of these? Scale up/down. Creation of replicas / geo-secondary operations. Any database copy-like workflow. If yes, it’s a strong indicator you’re seeing background workers tied to that lifecycle event. 2) Check whether they consume resources. A practical approach: Look for CPU/IO/log pressure at the database level. Compare the timing of slowness reports with spikes in waits/locks/log write percentage. If these sessions show minimal resource consumption and are just “present,” treat them as background noise while you investigate real contention. 3) Don’t try to kill them! These sessions are typically system/internal. Attempts to kill them may fail or be ineffective—and generally aren’t recommended. 4) If you need them to disappear. In many cases, these internal workers naturally age out. If they remain visible and you need a cleanup path, operational actions like failover/restart may clear stale workers (use change control / maintenance windows as appropriate for your environment). (This is a practical operational observation; always weigh downtime/impact.) When performance is actually slow: focus on what usually hurts. In many real-world incidents, the main causes of slowness are: Blocking chains / deadlocks. Transaction log rate throttling (LOG_RATE_GOVERNOR) during heavy DML. Hot queries running concurrently and contending on the same objects. Key takeaways Seeing many VDI_CLIENT_WORKER sessions is often expected around platform copy/seeding workflows and doesn’t automatically indicate a bottleneck. Don’t attempt to kill system/internal workers; instead, validate resource impact and focus on actual bottlenecks. For real slowness, prioritize diagnosing blocking/deadlocks and LOG_RATE_GOVERNOR-driven DML throttling.96Views0likes0CommentsBuilding an On-Device Voice Assistant with Microsoft Foundry Local
Why on-device voice still matters Most "voice AI" tutorials assume your audio leaves the machine. You ship a WAV to Whisper-API, your transcript to GPT-4, and a synthesized response back over the wire. That works — but it also means three round trips, three per-token bills, and three places your user's voice gets logged. The new wave of small, hardware-optimised models changes the trade-off. NVIDIA's Nemotron Speech Streaming En 0.6B is a 600M-parameter streaming ASR model published into the Microsoft Foundry Local catalog. Paired with a small chat model like qwen2.5-0.5b or phi-4-mini , you can run the entire capture → transcribe → reason → respond loop in-process on a developer laptop, with no API keys and no network egress. This post walks through how the fl-nemotron sample does it, the SDK pitfalls we hit on the way, and the design decisions that made the pipeline reliable. What we're building A browser-hosted assistant served by FastAPI at http://127.0.0.1:8000 . The page captures microphone audio, posts it to /api/transcribe , then streams the chat reply back over Server-Sent Events from /api/chat . All inference runs locally through two Foundry Local models loaded into the same process. The shape of the pipeline: Microphone (browser MediaRecorder) │ WebM/Opus blob ▼ Client-side WAV encoder (16 kHz, mono, PCM-16) │ multipart/form-data ▼ FastAPI /api/transcribe │ ▼ Nemotron Speech Streaming En 0.6B (Foundry Local audio client) │ transcript text ▼ Chat LLM e.g. qwen2.5-0.5b (Foundry Local chat client) │ streamed tokens ▼ FastAPI /api/chat → SSE → browser bubble The version that bit us: foundry-local-sdk >= 1.1.0 Before any code, the single most important fact about this project: The Nemotron Speech Streaming model only appears in the Foundry Local 1.1.x catalog. Older SDKs (0.5.x / 0.6.x) cannot resolve the alias nemotron-speech-streaming-en-0.6b and fail with model not found . The module name also changed in 1.1.0 — it is now foundry_local_sdk (with the underscore- sdk suffix), not foundry_local . The pip wheel for foundry-local-core is bundled, so there is no separate MSI / winget install to worry about. Pin it explicitly: pip install --upgrade "foundry-local-sdk>=1.1.0,<2" And verify before anything else: python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))" # expect: sdk 1.1.0 Loading both models from one manager The 1.1.x SDK exposes a single FoundryLocalManager that owns the runtime. Each loaded model gives you back a per-model OpenAI-compatible client — get_chat_client() for text models and get_audio_client() for ASR. There is no need to bring your own openai Python package; the SDK ships its own thin client. The wrapper used in the repo ( src/foundry_client.py ) does this: from foundry_local_sdk import Configuration, FoundryLocalManager FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron")) manager = FoundryLocalManager.instance chat_model = manager.load_model("qwen2.5-0.5b") stt_model = manager.load_model("nemotron-speech-streaming-en-0.6b") chat_client = chat_model.get_chat_client() audio_client = stt_model.get_audio_client() Both models are downloaded on first use into the Foundry Local cache and stay resident for the lifetime of the process. On a laptop with 16 GB RAM, the combined working set sits comfortably under 4 GB. The transcription surprise The first naive approach was the obvious one: with open(wav_path, "rb") as f: result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b") That call fails on Nemotron. The bundled ONNX Runtime GenAI in foundry-local-core does not register the nemotron_speech multi-modal model type that the standard AudioClient.transcribe() path tries to instantiate. The error surfaces as a cryptic model-type registration failure deep inside the native runtime. The fix is to use the streaming session API instead — a different native entry point ( core_interop.start_audio_stream ) that the streaming model does support. The repo isolates this in src/_nemotron_live.py : def transcribe_wav_live(audio_client, wav_path, *, language="en"): with wave.open(str(wav_path), "rb") as w: sample_rate = w.getframerate() channels = w.getnchannels() sample_width = w.getsampwidth() pcm = w.readframes(w.getnframes()) session = audio_client.create_live_transcription_session() session.settings.sample_rate = sample_rate session.settings.channels = channels session.settings.bits_per_sample = sample_width * 8 session.settings.language = language session.start() # Feed PCM in ~100 ms chunks from a worker thread, then stop. bytes_per_sec = sample_rate * channels * sample_width chunk_bytes = max(bytes_per_sec // 10, 1024) def _pusher(): try: for offset in range(0, len(pcm), chunk_bytes): session.append(pcm[offset:offset + chunk_bytes]) finally: session.stop() threading.Thread(target=_pusher, daemon=True).start() parts = [] for resp in session.get_stream(): for cp in getattr(resp, "content", []) or []: text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or "" if text: parts.append(text) return " ".join(p.strip() for p in parts if p.strip()).strip() Two things to notice: Push from a thread, read from the main coroutine. session.append() is a blocking write into the native stream and session.get_stream() is a blocking generator. Run one in a worker thread so the other can drain in parallel — otherwise you deadlock the session. Chunk to ~100 ms. Smaller chunks (e.g. 10 ms) spend more time crossing the FFI boundary than transcribing; larger chunks (e.g. 1 s) hold back partial results and hurt perceived latency. Always session.stop() . Without it the generator never terminates and the request hangs. The other transcription surprise: browsers don't send WAV Inside the browser, MediaRecorder defaults to audio/webm; codecs=opus . That's great for size but bad for our STT model, which expects a 16-bit mono PCM WAV at a known sample rate. Decoding WebM/Opus server-side would require ffmpeg as a runtime dependency — which is exactly the kind of friction this project exists to remove. The cleaner solution is to encode WAV on the client. AudioContext.decodeAudioData already understands WebM/Opus, so the page can decode the recording, resample to 16 kHz, mix to mono, and emit a PCM-16 WAV blob in 30 lines of JavaScript: // Inside src/static/index.html async function webmToWav(blob) { const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 }); const buf = await ctx.decodeAudioData(await blob.arrayBuffer()); // Mix to mono const ch = buf.numberOfChannels; const mono = new Float32Array(buf.length); for (let c = 0; c < ch; c++) { const data = buf.getChannelData(c); for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch; } return encodeWav(mono, 16000); } function encodeWav(samples, sampleRate) { const buffer = new ArrayBuffer(44 + samples.length * 2); const view = new DataView(buffer); // RIFF header writeStr(view, 0, "RIFF"); view.setUint32(4, 36 + samples.length * 2, true); writeStr(view, 8, "WAVE"); // fmt chunk writeStr(view, 12, "fmt "); view.setUint32(16, 16, true); // PCM chunk size view.setUint16(20, 1, true); // PCM format view.setUint16(22, 1, true); // mono view.setUint32(24, sampleRate, true); view.setUint32(28, sampleRate * 2, true); // byte rate view.setUint16(32, 2, true); // block align view.setUint16(34, 16, true); // bits per sample // data chunk writeStr(view, 36, "data"); view.setUint32(40, samples.length * 2, true); // PCM-16 samples let o = 44; for (let i = 0; i < samples.length; i++, o += 2) { const s = Math.max(-1, Math.min(1, samples[i])); view.setInt16(o, s < 0 ? s * 0x8000 : s * 0x7FFF, true); } return new Blob([view], { type: "audio/wav" }); } Now the server's /api/transcribe endpoint just writes the bytes to a temp file and hands them to transcribe_wav_live() — no audio decoding libraries on the Python side. Wiring it into FastAPI The server ( src/app.py ) is deliberately small. The notable detail is that the same process holds both Foundry Local model handles for its entire lifetime, so there is no warm-up cost per request: @app.post("/api/transcribe") async def transcribe(audio: UploadFile = File(...)): data = await audio.read() with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: f.write(data); path = f.name text = _ai_client.transcribe(path) return {"text": text} @app.post("/api/chat") async def chat(req: ChatRequest): if req.stream: return StreamingResponse( _sse(_ai_client.stream_completion(req.messages)), media_type="text/event-stream", ) return {"text": _ai_client.chat_completion(req.messages)} Streaming uses Server-Sent Events because they are trivially supported in both fetch() and the FastAPI runtime, and they don't require a WebSocket upgrade through any proxy a developer might have in front of localhost . What it looks like The repo includes screenshots of the running UI: a welcome screen with both models loaded, a streamed haiku reply, an inline code block with copy-to-clipboard, and the recording state for the microphone. Performance, honestly This is a small-model, CPU-friendly stack. On an Arm64 Surface running the x64 SDK under emulation: First model load (cold cache): tens of seconds — downloads ~600 MB for Nemotron and ~400 MB for qwen2.5-0.5b . Subsequent loads (warm cache): a few seconds per model. End-to-end transcription of a 5-second utterance: well under a second after warm-up. First chat token from qwen2.5-0.5b : typically 200–500 ms; full short reply within 1–2 s. On x64 silicon with a recent CPU the numbers improve substantially, and the SDK will pick the best execution provider it finds (CPU / DirectML / CUDA) for each model. Trade-offs to know about Model quality. qwen2.5-0.5b is a 500M-parameter model. It is fast and small enough to ship on a laptop, but it is not GPT-4. Swap in phi-4-mini or mistral-nemo-12b-instruct if you have the RAM and want better reasoning — the wrapper accepts any chat alias in the Foundry Local catalog. STT is English-only here. The current Nemotron streaming model in the catalog is ...-en-0.6b . Multilingual variants are likely to follow. Browser microphone needs a real browser. Headless / automated browsers (Playwright, Puppeteer) deny getUserMedia by default. Open the page in Edge / Chrome / Firefox to grant the permission and capture audio for real. No agent framework yet. This sample is deliberately a single-turn loop over a chat client — there is no tool calling, planning, or multi-agent orchestration. Adding the Microsoft Agent Framework on top would be a natural next step for richer behaviour. Responsible AI considerations Running locally removes the cloud-egress class of privacy concerns, but it does not remove responsibility: Disclose recording. The browser prompts for mic permission; your UI should make it obvious when capture is active. The sample shows a red ⏹ button and a "Recording…" banner for that reason. Don't log raw audio. The sample writes audio to a per-request NamedTemporaryFile and deletes it after transcription. Treat the WAV as sensitive data even when it never leaves the device. Small models hallucinate. A 0.5B chat model is great for snappy local replies, but unsuitable for high-stakes answers. Pair it with retrieval, ground it on your own data, or escalate to a larger model when accuracy matters. Try it Clone github.com/leestott/fl-nemotron. ./setup.ps1 (or ./setup.sh ) to create a virtualenv and install the pinned SDK. python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5b to download both models. .venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000 Open http://127.0.0.1:8000 in a real browser and click the 🎤 button. Where to go next Foundry Local documentation — official docs for the runtime, catalog, and SDK. microsoft/Foundry-Local — upstream samples and issue tracker. NVIDIA Nemotron model family — background on the speech and language models being published into the catalog. leestott/fl-nemotron — the full source for this post. Key takeaways Pin foundry-local-sdk >= 1.1.0 . Earlier SDKs cannot see the Nemotron Speech Streaming model. Use the LiveAudioTranscriptionSession API for Nemotron, not AudioClient.transcribe() . Encode WAV in the browser. It eliminates a heavy server-side ffmpeg dependency for a few lines of JS. Push audio chunks on a worker thread and drain the response generator on the main one to avoid deadlocks. A small Foundry Local chat model plus Nemotron STT gives you a credible local voice loop in a single Python process — no cloud, no keys, no data egress.Real-World Success Stories with PostgreSQL on Azure
Organizations rarely leap into cloud migrations or AI-powered systems overnight. They progress in deliberate stages, establishing a reliable data foundation, optimizing for performance, and then accelerating innovation. Across healthcare, financial services, and AI startups, companies are navigating this journey on Azure Database for PostgreSQL: a fully managed, enterprise-ready PostgreSQL environment with 58% lower total cost of ownership (TCO) compared to on-premises deployments. This post walks through real customer stories that span the full arc, from lift-and-shift migration to production-grade AI agent development, illustrating how Azure Database for PostgreSQL supports scalability, performance, security, and AI-readiness at every stage. Migrating with Confidence: Apollo Hospitals & August AI Apollo Hospitals operates a network of more than 74 hospitals and needed to move beyond a legacy on-premises Oracle system that had become difficult to manage and couldn't keep pace with growing data volumes. IT teams were spending their time on maintenance rather than innovation. Apollo migrated its core hospital information system backend to Azure Database for PostgreSQL. Working with partner Quadrant Technologies, the team lifted and shifted critical applications while using Azure DevOps to orchestrate CI/CD pipelines and Azure Application Insights for telemetry and observability. The results: 99.95% availability across hospital systems Database transactions executing within 5 seconds 40% reduction in deployment times via modern CI/CD pipelines Decreased operational overhead, freeing IT staff for higher-value work With a stable, scalable PostgreSQL backend in place, Apollo is now exploring real-time analytics and AI-enabled tools like Microsoft 365 Copilot to advance patient care. "We saw Azure Database for PostgreSQL as the right foundation for the future. It's open, cost-effective, and capable of supporting the hospital information system we built in-house." — Shankar Krishna A., General Manager of IT, Apollo Hospitals Apollo's experience is not unique. August AI, a healthcare-tech startup offering an AI-driven medical companion, migrated its entire stack to Azure—with Azure Database for PostgreSQL storing mission-critical patient data while meeting strict compliance requirements such as HIPAA. The result: scaling from roughly 500,000 users to 3.5 million+ users worldwide, with zero downtime during the cutover, completed in just three months. As Founder and CEO Anuruddh Mishra noted: "We receive a log of queries that are not performing optimally, and within a couple of minutes we can optimize that query with PostgreSQL on Azure and move on". Modernizing at Scale: Nasdaq Migration is often the first step. Nasdaq demonstrates what becomes possible when organizations modernize their architecture on a scalable data foundation. To improve its Nasdaq Boardvantage platform—used by corporate boards to collaborate on governance documents—Nasdaq re-architected on Azure. The team containerized services with Azure Kubernetes Service (AKS) and adopted Azure Database for PostgreSQL alongside Azure Database for MySQL as persistent data stores for governance workloads. This architecture provided the flexibility, performance, and security required for a multitenant platform handling sensitive board materials. With the data layer in place, Nasdaq integrated Microsoft Foundry and Azure OpenAI to deliver AI-powered summarization and workflow automation. The measurable outcomes: 60% reduction in reading time through AI-powered document summarization 25% decrease in administrative preparation time across board workflows Up to 97% accuracy in AI-generated summaries and meeting minutes A reusable AI framework established for future extensibility "Both Azure Database for PostgreSQL and Azure Database for MySQL gave us the right balance of performance, security, and control. The governance workloads we handle are unique, so we needed something that could meet those isolation and encryption requirements." — Scott Ellison, Vice President of Technology, Nasdaq Building Intelligent Applications: SubgenAI and OpenAI Azure Database for PostgreSQL now supports native vector search via pgvector, high-performance DiskANN indexing, semantic operators and AI model management, and integrated graph capabilities for relationship reasoning—making it a production-ready foundation for intelligent applications. SubgenAI, a European generative AI company, built its flagship platform Serenity Star on Azure Database for PostgreSQL and Microsoft Foundry to transform AI agent development from a code-heavy, fragmented process into a streamlined, no-code experience. A core technical requirement: the platform's retrieval-augmented generation (RAG) system needs efficient vector search against embedded content while maintaining enterprise-grade reliability. After evaluating several database options, SubgenAI chose Azure Database for PostgreSQL with pgvector for its accurate and scalable vector similarity search. Serenity Star customers can now: Launch AI agents in as little as 15 minutes Cut coding and development time by 50% Resolve most AI agent queries in under 60 seconds [ "With Microsoft and Azure Database for PostgreSQL we have total control and an environment that is truly dynamic and can adapt to the evolution we're looking for." — Julia Schröder Langhaeuser, VP of Product Serenity Star, SubgenAI At the extreme end of scale, OpenAI runs PostgreSQL on Azure to support production systems behind ChatGPT. As write scalability limits emerged on an initially unsharded single primary instance, OpenAI offloaded write-heavy operations to other systems and optimized read workloads using PgBouncer for connection pooling. The Azure Database for PostgreSQL team responded by developing the elastic clusters feature, enabling horizontal scaling through row-based and schema-based sharding. The team reduced connection latency from approximately 50 ms to under 5 ms, scaled reads horizontally with multiple replicas, and improved reliability by prioritizing critical requests—all achieved by a small team making systematic optimizations on open-source PostgreSQL. "After all the optimization we did, we are super happy with Postgres right now for our read-heavy workloads. It's really scalable and reliable." — Bohan Zhang, Member of the Technical Staff, OpenAI Meeting You Where You Are Beyond these stories, organizations like BMW Group (cloud-native applications at global scale), Ahold Delhaize (highly available retail applications), Mott MacDonald (an AI agent accelerating onboarding and spreading best practices across 220,000 employees), and Multitude (scaling responsibly in regulated environments) all run on Azure Database for PostgreSQL. The service offers 99.99% availability with automatic failover and SLA, independent compute and storage scaling, and intelligent performance recommendations, available across 60+ Azure regions. Developer tooling including the PostgreSQL extension for Visual Studio Code with GitHub Copilot further accelerates productivity. Whether you are planning your first migration or building production AI agents, these stories share a clear signal: Azure Database for PostgreSQL delivers a scalable, secure, AI-ready data foundation at every stage of growth. Explore full customer stories in depth in the eBook: Customer Success Stories with Azure Database for PostgreSQL.155Views1like0Comments