performance
576 TopicsLog Insights in Minutes: A Simpler pgBadger Workflow
Sometimes the fastest way to understand a PostgreSQL workload is not another dashboard. It is a good log report. pgBadger is a PostgreSQL log analysis tool that turns raw PostgreSQL logs into an interactive HTML report. It helps summarize query activity, connection patterns, errors, temporary files, lock waits, autovacuum activity, and more. Earlier guidance for generating pgBadger reports from Azure Database for PostgreSQL Flexible Server focused on exporting logs through Diagnostic Settings, storing them in a storage account, and then using tools such as BlobFuse and jq to extract PostgreSQL log lines from JSON files. That workflow is still useful when customers centralize logs across multiple servers. However, if you are already using the Server logs feature in Azure Database for PostgreSQL Flexible Server, there is a much simpler path. In this post: You’ll learn how to generate a pgBadger HTML report from Azure Database for PostgreSQL Flexible Server by downloading native PostgreSQL .log files directly from the Azure portal. No storage account, BlobFuse mount, or JSON extraction required. Fast path Configure log_line_prefix . Enable Server logs for download. Download the PostgreSQL .log files. Run pgBadger with the matching prefix. Open pgbadger-report.html . Why use this workflow? With Server logs, you can download native PostgreSQL .log files directly from the Azure portal and run pgBadger locally. Older path Simpler path in this blog Diagnostic Settings → Storage account → BlobFuse → JSON extraction → pgBadger Server logs → Download .log files → pgBadger Area Older Diagnostic Settings workflow Server logs workflow Export path Diagnostic Settings to storage account Download .log files directly from the portal Format JSON payloads need extraction Native PostgreSQL .log files Extra tooling BlobFuse and jq JSON parsing None Best suited for Centralized or multi-server logging Quick per-server analysis Outcome Flexible, but more setup Faster path to pgBadger Recommended: Use the Server logs workflow when you want a fast, low-friction way to generate a pgBadger report from one Azure Database for PostgreSQL Flexible Server. When should you use this workflow? Use this workflow when... Use Diagnostic Settings when... You need a quick report for one Flexible Server. You centralize logs from many servers. You want to run pgBadger locally. You need long-term retention or workspace-level querying. You want to avoid JSON extraction. You already have automated log export pipelines. Before you start A machine where you can install or run pgBadger. A working Perl runtime. Git Bash on Windows, so the multi-line shell commands work as shown. Portal access to your Azure Database for PostgreSQL Flexible Server. Permission to update server parameters and enable Server logs. Important: pgBadger can only analyze what PostgreSQL logs capture. To populate query timing and slow-query sections in the report, enable log_min_duration_statement before collecting logs. Logs collected before that change will not include duration data. Workflow overview Task Type Rough effort Install or prepare pgBadger One-time setup per analysis machine 5–10 minutes Configure log_line_prefix One-time setup per server 2–3 minutes Enable Server logs One-time setup per server 2–3 minutes Download logs and run pgBadger Repeatable 2–5 minutes Install or prepare pgBadger on the machine where you will analyze logs. Configure log_line_prefix so pgBadger can parse each log line. Enable Server logs, so PostgreSQL logs are available for download. Download the logs and run pgBadger locally. 💡Pro tip: Start with a narrow log window first. Use one or two hourly log files, confirm the report looks right, and then expand the analysis window if needed. Step 1: Install pgBadger Before generating a report, you need pgBadger available on the machine where you plan to analyze the downloaded PostgreSQL log files. Run this on a Linux VM, WSL, or another Linux-based environment where you can install packages. Note: Azure Cloud Shell may work for quick testing, but package installation and build-tool availability can vary by session. For repeatable analysis, use a Linux VM, WSL, or another environment you control. Copy and run sudo apt-get update && sudo apt-get install -y git perl make gcc && \ git clone https://github.com/darold/pgbadger.git && \ cd pgbadger && \ perl Makefile.PL && \ make && \ sudo make install && \ pgbadger -V What good looks like: The install command completes successfully and pgbadger -V returns the installed pgBadger version. Step 2: Configure log_line_prefix This is a one-time server configuration step. The log_line_prefix parameter controls the beginning of each PostgreSQL log line. pgBadger uses this prefix to extract useful fields such as timestamp, user, database, and process ID. In the Azure portal, open your Flexible Server and go to Server parameters. Search for: Parameter log_line_prefix Set this value %m user=%u db=%d pid=%p: Then select Save. In Server parameters, confirm that the custom value is saved for log_line_prefix . Figure 1: Set log_line_prefix so pgBadger can correctly parse timestamp, user, database, and process ID from each log line. Prefix tokens Token Meaning %m Timestamp with milliseconds %u Username %d Database name %p Process ID After this change, log lines should look like this: Example log line 2026-06-22 19:00:00.070 UTC user=pgadmin db=highcpu pid=3805603: LOG: statement: SELECT 1 FROM pg_extension WHERE extname='pg_stat_statements' The matching pgBadger prefix for this log format is: Matching pgBadger prefix %m user=%u db=%d pid=%p: You will use this same value later in the pgBadger command. What good looks like: The server parameter is saved, and new PostgreSQL log lines begin with timestamp, user, database, and process ID fields that match the pgBadger prefix. Step 3: Enable Server logs for download This is also a one-time setup step. In the Azure portal, open your Flexible Server and go to Server logs. Enable: Portal setting Capture logs for download Set the retention period based on how long you want logs to remain available for download. For example, a 7-day retention period keeps logs available for download for 7 days. In Server logs, enable Capture logs for download and choose the retention window. Figure 2: Enable Capture logs for download and set a retention period long enough to cover the analysis window you want to inspect. What good looks like: After Server logs are enabled, hourly PostgreSQL log files appear in the Server logs blade and can be downloaded from the Azure portal. Once enabled, hourly log files appear in the Server logs blade. The files are named by date and hour, for example: Example log files postgresql_2026_06_22_19_00_00.log postgresql_2026_06_22_20_00_00.log Step 4: Download and organize the logs locally From the Server logs page, select the .log files for the time window you want to analyze and download them. For example, to analyze activity between 19:00 and 21:00 UTC, download: Example files to download postgresql_2026_06_22_19_00_00.log postgresql_2026_06_22_20_00_00.log On your local machine, create a folder for that analysis window. A simple convention is to use the Mon-DD format. Folder name Jun-22 Place the downloaded .log files inside that folder. Your local folder structure should look like this: Folder structure pgbadger-13.1/ pgbadger Jun-22/ postgresql_2026_06_22_19_00_00.log postgresql_2026_06_22_20_00_00.log Step 5: Generate the pgBadger report Open Git Bash from the folder where pgBadger is located. For example, if pgBadger is inside the pgbadger-13.1 folder, open Git Bash from that folder. # Action Command 1 Set the folder FOLDER=Jun-22 2 Confirm files ls -lh ./$FOLDER 3 Run pgBadger Use the full command below. Copy and run FOLDER=Jun-22 ls -lh ./$FOLDER perl -X ./pgbadger -f stderr \ --prefix '%m user=%u db=%d pid=%p:' \ ./$FOLDER/*.log \ -o ./$FOLDER/pgbadger-report.html Command breakdown Part of command Purpose perl -X ./pgbadger Runs pgBadger and suppresses non-critical Perl warnings. -f stderr Parses PostgreSQL stderr log files. --prefix '%m user=%u db=%d pid=%p:' Matches the log_line_prefix set on the server. ./$FOLDER/*.log Analyzes every .log file in the selected folder. -o ./$FOLDER/pgbadger-report.html Writes the HTML report into the same folder. When the command completes successfully, you should see output like this: Expected output Parsed 12134249 bytes of 12134249 (100.00%), queries: 26684, events: 83 LOG: Ok, generating html report... What good looks like: pgBadger finishes parsing the logs and creates pgbadger-report.html in the selected folder. Step 6: Open the report Open the generated report: Copy and run start ./$FOLDER/pgbadger-report.html The report opens in your default browser. The final report is created here: Generated report path Jun-22/pgbadger-report.html What the report can show The pgBadger report gives you a quick view into the workload shape for the selected log window. For example, in a sample run across two hourly log files, pgBadger summarized: Total number of queries. Number of unique normalized queries. Query traffic over time. Events such as errors and fatal messages. Session and connection patterns. Once the report opens, start with Global Stats to confirm the time range, total queries, normalized queries, and query peak. Figure 3: Start with Global Stats to validate the selected time range, total query count, normalized query count, and query peak. Query volume and normalized queries Many raw queries can often reduce to a smaller number of normalized query patterns. This helps identify whether the workload is spread across many different query shapes or dominated by a smaller set of repeated statements. Example: In this sample run, 26,684 queries reduced to 59 normalized query shapes. That suggests the workload is mostly a small set of repeated statements, which can help focus tuning effort. Traffic patterns The SQL Traffic section helps identify spikes, quiet periods, and workload changes over time. Figure 4: Use SQL Traffic to identify query spikes, quiet periods, and workload changes during the selected log window. Figure 5: Review the query breakdown to compare read vs. write volume and query-type distribution for the selected Server logs window. For example, if the report shows a steady baseline followed by a sharp spike, that spike can be correlated with application activity, batch jobs, synthetic tests, or operational events during the same time window. Query duration If query duration shows 0 ms or the slow query sections are empty, it usually means duration logging was not enabled when the logs were collected. In that case, pgBadger can still show query counts and events, but it cannot calculate the slowest queries, total execution time, average duration, or maximum duration. To unlock those timing sections, enable log_min_duration_statement , collect fresh logs, and rerun pgBadger. What pgBadger cannot infer from missing logs pgBadger reports are only as complete as the log data you provide. If PostgreSQL did not log duration, lock waits, temporary files, or autovacuum activity during the selected time window, pgBadger cannot reconstruct those details later. To analyze... Enable before collecting logs Slow queries log_min_duration_statement Lock waits log_lock_waits Temporary files log_temp_files Autovacuum activity log_autovacuum_min_duration Repeatable copy/paste block Reusable command block Change only FOLDER for each new analysis window. Copy and run FOLDER=Jun-22 ls -lh ./$FOLDER perl -X ./pgbadger -f stderr \ --prefix '%m user=%u db=%d pid=%p:' \ ./$FOLDER/*.log \ -o ./$FOLDER/pgbadger-report.html start ./$FOLDER/pgbadger-report.html For another date, change only this line: Update this value FOLDER=Jun-22 Examples: Example folder values FOLDER=Jun-23 FOLDER=Jul-01 FOLDER=Aug-15 Optional: Improve report quality pgBadger can only analyze the information captured in PostgreSQL logs. The default logs may be enough for query frequency, connection activity, and errors. For deeper performance troubleshooting, consider enabling additional logging parameters based on your scenario. Scenario Parameter Suggested value Notes Slow query analysis log_min_duration_statement 1000 Logs statements slower than 1 second. Short controlled test log_min_duration_statement 0 Logs every statement. Use carefully. Lock troubleshooting log_lock_waits on Helps identify lock waits. Temporary file analysis log_temp_files 0 Logs all temporary files. Autovacuum visibility log_autovacuum_min_duration 0 Useful during focused analysis. Useful parameters include: Recommended logging parameters log_lock_waits = on log_temp_files = 0 log_autovacuum_min_duration = 0 To capture query durations, configure: Duration logging log_min_duration_statement = 1000 This logs statements that run longer than 1000 milliseconds. For short test runs, you can temporarily use: Short test run only log_min_duration_statement = 0 Caution: Use log_min_duration_statement = 0 carefully on busy production servers. It logs every statement and can generate a large volume of logs. Duration matters: If duration logging is not enabled, pgBadger can still show query counts and events, but slowest-query, total duration, average duration, and maximum duration sections will be limited or empty. Common mistakes and quick fixes Symptom Likely cause Fix Report is empty Prefix mismatch Match --prefix with log_line_prefix . No duration data Duration logging was not enabled Set log_min_duration_statement before collecting logs. No files visible Server logs disabled or retention expired Enable capture and check retention. pgBadger command fails pgBadger is not in the current folder or path Run pgbadger -V to confirm installation. Common troubleshooting FAQs 1. Report is created but empty This usually means the pgBadger prefix did not match the actual log format. Check the first few lines: Copy and run head -5 ./$FOLDER/*.log Make sure the pgBadger --prefix matches the server’s log_line_prefix . 2. Report shows queries but no duration PostgreSQL logged statements but did not log durations. Enable one of the following, collect fresh logs, and rerun pgBadger: Parameter options log_min_duration_statement = 1000 # or temporarily for testing log_min_duration_statement = 0 3. No .log files are visible Confirm that Server logs are enabled: Portal setting Capture logs for download Also check the retention period. If the retention period has expired, older logs may no longer be available for download. 4. pgBadger command fails Confirm that pgBadger is available in the current folder or installed in your path. Copy and run pgbadger -V If you are running pgBadger from the local folder, use: Copy and run perl -X ./pgbadger Summary For customers already using Azure Database for PostgreSQL Flexible Server logs, the pgBadger workflow is straightforward: Install pgBadger. Configure log_line_prefix . Enable Server logs for download. Download the .log files. Place them in a local date-based folder. Run pgBadger with the matching prefix. Open pgbadger-report.html . Bottom line: Server logs give you the shortest path from Azure Database for PostgreSQL Flexible Server logs to a pgBadger report. Download the native .log files, run pgBadger with the matching prefix, and open the generated HTML report. References pgBadger - source and documentation GitHub pgBadger - project site Azure - Download server logs from the portal Flexible Server Azure - Logging concepts Flexible Server Azure - Configure server parameters via the portal PostgreSQL - log_line_prefix and logging parameters277Views2likes0CommentsGitHub Copilot App - Canvas Is Not a UI Builder
What if your development environment didn't just help you write code, but helped you observe, steer, and evolve a living system while it runs? That's the shift GitHub Copilot App Canvas represents. Canvas redefines how developers interact with agent-driven software: not by building traditional user interfaces, but by creating interactive environments where humans and AI co-create, test, and iterate in real time. This post walks through a real Canvas extension we built, a Multi-Agent Dev Canvas that demonstrates how Canvas becomes a runtime observability and control plane for an agent-driven system. We'll cover why Canvas exists, how it differs from traditional UI development, and how you can use it to accelerate the design-test-evolve loop for any multi-agent application. The Misconception: "Canvas Is for Building UIs" The first instinct many developers have when they see Canvas is to treat it like a UI framework, a place to build dashboards, boards, or user-facing applications. That's not what Canvas is for. Here's the distinction that matters: Traditional UIs are for using software. They serve end-users who interact with a finished product. Canvas is for shaping software while it runs. It serves developers and AI agents who are actively building, testing, and evolving a system. Canvas solves problems your final UI should never try to solve in a visible way. It's the observability layer, the control plane, the validation surface — all the things you need during development that disappear before production. Think of it this way: you wouldn't ship your debugger to users, but you absolutely need it while building. What We Built: A Multi-Agent Dev Canvas To demonstrate Canvas as a development runtime, we built a Multi-Agent Dev Canvas, a standalone GitHub Copilot Canvas extension (this repo, copilot-canvas-runtime) that treats an entire multi-agent system as a living, observable environment. The same pattern applies to any agent-driven system built on services such as Microsoft Foundry. The Multi-Agent Dev Canvas: a runtime observability and control plane where developers and AI agents collaborate to design, test, and evolve an agent-driven system in real time. The canvas provides four integrated panels: System View: See Your Agents Working Five specialised agents are displayed as live cards with real-time status indicators. Each card shows the agent's name, responsibility, current status (idle, running, done, or error), task count, and last action taken. When an agent is active, its card pulses blue. When it fails, it glows red. You see the system breathe. decompose_system — Breaks requirements into agent tasks execute_workflow — Coordinates agents to perform tasks validate_output — Runs evaluation tests and returns structured results update_system_design — Modifies architecture based on feedback track_state — Persists and updates system state over time Task Flows: Watch Work Move Through the Pipeline Below the agents, a flow graph visualises how tasks route between agents. When you decompose a system requirement like "Build an AI-powered code review agent," the canvas shows five components (pr-ingestion, code-analysis, feedback-generator, learning-loop, notification-service) flowing from the decomposer to the executor and designer agents. Each flow carries a status badge, pending, pass, or fail. Validation Panel: Continuous Testing, Not Afterthought Testing The validation panel displays structured test results with pass/fail badges and reasoning. When you run validation, each test case evaluates against specific criteria: ✅ "PR ingestion handles large diffs" — Meets criteria: process diffs over 5,000 lines without timeout ❌ "Feedback is actionable" — Failed: does not satisfy criteria that each suggestion includes a code fix ✅ "Learning loop converges" — Meets criteria: accept rate improves over 10 iterations ✅ "Notifications are non-blocking" — Meets criteria: delivery latency under 500ms This isn't a test runner you invoke separately, it's a validation surface embedded in the development loop. You see failures the moment they happen, in context, alongside the agents and flows that produced them. Live State Timeline: Every Mutation, Visible The right panel tracks every state change with timestamps. Decomposition events, workflow executions, validation runs, failure injections — all appear chronologically. This is the system's memory, visible to both the human developer and the AI agents working alongside them. Canvas as a Runtime: The Key Capabilities What makes Canvas a runtime rather than a display layer is that the agent can act through it. The canvas exposes seven agent-callable actions: Action What It Does decompose_system Accept requirements and components, generate task flows, update the system design execute_workflow Run pending tasks through the agent pipeline, produce artifacts validate_output Evaluate test cases against criteria, return structured pass/fail with reasoning update_system_design Modify the architecture description, constraints, or component list live track_state Read the full system state — agents, flows, validations, history, artifacts inject_failure Force an agent into an error state to test system adaptation pause_resume Toggle execution on and off The human developer can click Decompose, Execute, or Validate directly in the canvas. The AI agent can invoke the same actions programmatically. Both parties operate on the same surface, the same state, the same system, that's what makes Canvas collaborative in a way traditional tooling is not. Why This Matters: Canvas vs. Figma vs. Traditional UIs It helps to position Canvas against tools developers already know: Figma is Human-to-Human collaboration on design. Multiple people interact with the same visual surface, but nothing executes. It's a design tool. Traditional UIs are Human-to-System. Users interact with finished software through a polished interface. Canvas is Human-to-AI-to-System. It's a shared space where things actually execute. The developer steers, the AI acts, and the system evolves, all visible, all in real time. Canvas is collaborative in the Figma sense — it's a shared space, it's visual, multiple participants interact with the same surface. But unlike Figma, the participants include AI agents, and the surface isn't a mockup — it's a live system. How the Extension Works: Under the Hood A Canvas extension is a standard GitHub Copilot CLI extension, a single extension.mjs file that speaks JSON-RPC over stdio. The key components: 1. State Management Each canvas instance maintains its own system state: agents, task flows, validations, a state history timeline, artifacts, and the current system design. State is held in-memory per instance and pushed to the iframe via Server-Sent Events whenever it changes. function createInitialState() { return { agents: [ { id: "decomposer", name: "decompose_system", status: "idle", responsibility: "Break requirements into agent tasks" }, { id: "executor", name: "execute_workflow", status: "idle", responsibility: "Coordinate agents to perform tasks" }, // ... three more agents ], taskFlows: [], validations: [], stateHistory: [], artifacts: [], systemDesign: { description: "", constraints: [], components: [] }, execution: { paused: false, stepCount: 0 }, }; } 2. Real-Time Updates via Server-Sent Events The canvas runs a loopback HTTP server per instance. The iframe connects to an /events endpoint and receives state updates as they happen — no polling, no websocket complexity. if (req.url === "/events") { res.writeHead(200, { "Content-Type": "text/event-stream", "Cache-Control": "no-cache" }); clients.add(res); // Push current state immediately on connect res.write(`data: ${JSON.stringify(getState(instanceId))}\n\n`); } 3. Dual Interaction Model Every action is available through two paths. The human clicks a button in the iframe, which POSTs to the local server. The AI agent calls invoke_canvas_action through the SDK. Both paths mutate the same state and trigger the same SSE broadcast. Neither is privileged over the other. 4. Canvas Declaration The canvas registers with the Copilot SDK using createCanvas , declaring its identity, description, and all agent-callable actions with JSON Schema validation on inputs: createCanvas({ id: "multi-agent-dev", displayName: "Multi-Agent Dev Canvas", description: "Runtime observability and control plane for multi-agent development", actions: [ { name: "decompose_system", description: "Break requirements into agent tasks", inputSchema: { type: "object", properties: { requirements: { type: "string" }, components: { type: "array", items: { type: "string" } } }, required: ["requirements"] }, handler: async (ctx) => { /* ... */ }, }, // ... six more actions ], open: async (ctx) => { /* start server, return URL */ }, onClose: async (ctx) => { /* clean up */ }, }); Scenarios This Enables The Multi-Agent Dev Canvas supports four development scenarios that would be impossible with traditional tooling: 1. End-to-End Feature Design Tell the agent "Build an AI-powered code review system." Watch it decompose the requirement into five components, route tasks to specialist agents, execute the workflow, and validate the outputs, all visible in real time. Iterate by modifying constraints or components and re-running. 2. Live Agent Collaboration Observation See how agents hand off work to each other. The flow graph shows which agent produced what, which tasks are pending, and where bottlenecks form. This is the kind of observability you need when debugging multi-agent orchestration but would never expose in a production UI. 3. Fault Injection and Adaptation Testing Use inject_failure to force an agent into an error state. Watch how the system responds. Does the orchestrator recover? Do downstream tasks fail gracefully? This chaos-engineering approach, applied during development, visible in real time, catches integration failures before they reach production. 4. Validation-Driven Iteration Define test criteria, run validation, see which tests fail, update the system design, re-run. The validation panel isn't a separate CI pipeline, it's embedded in the development surface, creating a continuous feedback loop between design decisions and their measurable outcomes. Getting Started: Build Your Own Canvas Extension To create a Canvas extension in your own project: Read the SDK docs — Run extensions_manage({ operation: "guide" }) in GitHub Copilot CLI to get the canonical documentation paths. Scaffold — Run extensions_manage({ operation: "scaffold", kind: "canvas", name: "my-canvas", location: "project" }) to generate the boilerplate. Implement — Edit extension.mjs with your canvas logic: state model, actions, renderer HTML, and SSE updates. Reload — Run extensions_reload to activate your changes. Drive — Open with open_canvas , invoke actions with invoke_canvas_action , and iterate. The canvas extension lives in .github/extensions/your-canvas/extension.mjs for project-scoped extensions, or in your user extensions directory for personal use. No package.json needed, the github/copilot-sdk import is auto-resolved. Key Takeaways Canvas is a development runtime, not a UI framework. You don't build Canvas instead of your UI, you use Canvas to figure out, test, and evolve the UI and system before and during building it. Canvas solves problems your final UI should never expose. Agent observability, fault injection, live state mutation, validation feedback loops, these are development concerns, not user concerns. Canvas is Human-to-AI-to-System collaboration. Both the developer and the AI agent operate on the same surface, the same state, the same running system. It's Figma-like collaboration, but with AI agents, and things actually execute. Canvas turns debugging, testing, and execution into a continuous visual feedback loop. Instead of switching between an editor, a terminal, a test runner, and a monitoring dashboard, you have one surface where the system lives and evolves. Canvas extensions are lightweight. A single extension.mjs file, no dependencies, loopback HTTP server with SSE, the infrastructure gets out of the way so you can focus on the system you're building. The Bigger Picture Canvas redefines software development by shifting from writing static code to orchestrating living systems. Developers and AI co-create, observe, and evolve solutions in real time. Instead of building UIs for users, we build interactive environments for agents, turning debugging, testing, and execution into a continuous, visual feedback loop that accelerates innovation and brings ideas to production faster than ever. The Multi-Agent Dev Canvas we built here is one example. The pattern applies anywhere you're building agent-driven systems: AI orchestration, workflow automation, data pipelines, autonomous services. Anywhere you need to see, steer, and validate a complex system as it runs, that's where Canvas belongs. Resources copilot-canvas-runtime — this repository: the Multi-Agent Dev Canvas extension, scenario, and demo prompt GitHub Copilot Documentation — Official documentation for GitHub Copilot features Microsoft Foundry Documentation — Build and deploy AI agents with Microsoft FoundryGeneric Best Practices for HikariCP with Azure Database for PostgreSQL
Author: Mohamed Baioumy Technology: Azure Database for PostgreSQL (Flexible Server & Single Server) Category: Connectivity | Performance | Application Design Introduction Connection pooling is a critical component of application performance when connecting to Azure Database for PostgreSQL. Creating a new PostgreSQL connection is an expensive operation that consumes CPU, memory, and networking resources. Reusing existing connections through a connection pool significantly reduces connection latency, improves throughput, and helps applications scale more efficiently. Many Java applications use HikariCP, one of the most popular high-performance JDBC connection pools. While HikariCP provides excellent performance out of the box, improperly configured connection pool settings can lead to issues such as: Connection pool exhaustion Stale or invalid connections Increased connection acquisition latency Excessive connection creation and destruction Database resource contention Application timeouts This article summarizes generic guidance and best practices for configuring HikariCP when working with Azure Database for PostgreSQL Flexible Server and Azure Database for PostgreSQL Single Server. Understanding Key HikariCP Parameters 1. Maximum Lifetime (maxLifetime) The maxLifetime property controls how long a connection can remain in the pool before HikariCP retires it and creates a new one. Why It Matters Connections can become stale over time due to: Network interruptions Infrastructure updates Connection state changes TCP idle behavior Recycling connections periodically helps prevent applications from using long-lived connections that may no longer be healthy. Recommended Practice Avoid configuring the value too low. When maxLifetime is set aggressively, HikariCP continuously destroys and recreates connections, resulting in: Additional authentication overhead Increased connection establishment latency Higher CPU utilization Reduced application throughput A reasonable starting point is: spring.datasource.hikari.maxLifetime=1800000 30 minutes (1,800,000 ms) is commonly used and aligns well with many production workloads. Depending on workload characteristics, values between 30 minutes and 1 hour are generally suitable Avoid maxLifetime=300000 (5 minutes) This often causes unnecessary connection churn without providing additional benefits. 2. Minimum Idle Connections (minimumIdle) The minimumIdle setting defines how many idle connections HikariCP should keep ready for immediate use. Why It Matters A pool with available idle connections can serve application requests immediately without waiting for new connections to be established. However, maintaining too many idle connections consumes unnecessary database resources. Recommended Practice For most workloads: minimumIdle = maximumPoolSize Or minimumIdle slightly lower than maximumPoolSize This ensures sufficient connections are already available during traffic spikes while avoiding excessive connection creation delays. Example maximumPoolSize=20 minimumIdle=15 Avoid maximumPoolSize=20 minimumIdle=20 only when the application experiences long periods of inactivity and conserving resources is more important than immediate responsiveness. 3. Idle Timeout (idleTimeout) The idleTimeout property determines how long an unused connection remains in the pool before being removed. Why It Matters Connections that sit idle for extended periods consume resources on both: The application server Azure Database for PostgreSQL However, removing idle connections too quickly causes the application to repeatedly establish new connections. Recommended Practice Keep the default value unless there is a specific requirement. spring.datasource.hikari.idleTimeout=600000 which equals: 10 minutes (600,000 ms) This setting provides a good balance between resource utilization and responsiveness. [Re: EXT: R...0040002947 | Outlook] The timeout should also be comfortably longer than any expected short application idle periods. Avoid idleTimeout=10000 (10 seconds) Such aggressive settings often result in unnecessary connection creation cycles. 4. Maximum Pool Size (maximumPoolSize) This parameter determines the maximum number of concurrent database connections the application can maintain. Why It Matters This is often the most important HikariCP setting. If the Pool Is Too Small Applications may experience: Connection is not available, request timed out because all available connections are already in use. Similar scenarios have been observed during customer investigations involving Hikari pool exhaustion. If the Pool Is Too Large Applications can overwhelm the database server with excessive concurrent sessions, resulting in: Connection contention Increased context switching Higher memory consumption Reduced overall performance Recommended Practice Pool size should be based on: Database compute configuration CPU core count Query execution duration Application concurrency requirements Workload characteristics There is no universal value that fits every workload. Start conservatively: maximumPoolSize=10 or maximumPoolSize=20 maximumPoolSize=20 and increase only after load testing demonstrates a need for additional concurrency. Fixed-Size Pool Recommendation For many production workloads, a fixed-size pool provides the simplest and most predictable behavior. Configure: maximumPoolSize=20 minimumIdle=20 or omit minimumIdle entirely so it defaults to maximumPoolSize. HikariCP commonly recommends maintaining a fixed-size pool for responsiveness during demand spikes. Benefits Faster connection acquisition Predictable performance Reduced connection creation latency Better handling of traffic spikes When using a small fixed-size pool, there is often little need to aggressively tune: minimumIdle idleTimeout Instead, simply recycle connections using: maxLifetime maxLifetime Additional Recommendations Enable TCP Keepalive One common cause of stale connections is network devices silently dropping inactive TCP sessions. For PostgreSQL applications, consider enabling TCP keepalive: tcpKeepAlive=true tcpKeepAlive=true The HikariCP project specifically recommends enabling TCP keepalive to prevent rare situations where pools can lose valid connections. Monitor Connection Usage Track: Active connections Idle connections Connection acquisition time Pool exhaustion events Database connection counts These metrics help identify whether pool sizing is appropriate. Investigate Long-Running Queries Connection pool problems are often symptoms rather than root causes. A frequent scenario is: A query becomes slow. Connections remain occupied longer. The pool becomes exhausted. Applications start timing out. When analyzing HikariCP issues, always review: Query performance Blocking situations Database resource utilization Application connection handling logic Sample Production Configuration spring.datasource.hikari.maximumPoolSize=20 spring.datasource.hikari.minimumIdle=15 spring.datasource.hikari.maxLifetime=1800000 spring.datasource.hikari.idleTimeout=600000 spring.datasource.hikari.connectionTimeout=30000 spring.datasource.hikari.keepaliveTime=60000 spring.datasource.hikari.maximumPoolSize=20 spring.datasource.hikari.minimumIdle=15 spring.datasource.hikari.maxLifetime=1800000 spring.datasource.hikari.idleTimeout=600000 spring.datasource.hikari.connectionTimeout=30000 spring.datasource.hikari.keepaliveTime=60000 This configuration provides a solid starting point for many Azure Database for PostgreSQL workloads and can be adjusted based on application-specific requirements. a { text-decoration: none; color: #464feb; } tr th, tr td { border: 1px solid #e6e6e6; } tr th { background-color: #f5f5f5; } Conclusion HikariCP is extremely efficient when configured appropriately. The goal is not to maximize the number of connections, but rather to maintain a healthy balance between application responsiveness and database resource consumption. As a general rule: Use a reasonable maxLifetime (30–60 minutes) Keep enough idle connections available for traffic spikes Avoid aggressive idleTimeout values Size the pool based on workload characteristics, not guesses Consider fixed-size pools for predictable performance Monitor connection usage and query performance regularly By following these practices, applications connecting to Azure Database for PostgreSQL can achieve improved scalability, lower latency, and more reliable connectivity. References Connection pooling best practices - Azure Database for PostgreSQL Performance best practices for using Azure Database for PostgreSQL – Connection Pooling HikariCP Documentation and Pool Sizing Guidance101Views0likes0CommentsLessons Learned #541:Automatic Plan Correction vs External Tables: A Practical Lesson from the Field
Automatic Plan Correction is one of the most useful capabilities in Azure SQL Database when dealing with plan regressions. It uses Query Store to identify when a query starts using a worse execution plan and, when appropriate, forces the last known good plan. However, during a recent troubleshooting scenario, I found that not all queries have the same execution characteristics. In particular, queries that reference external tables may behave differently from fully local queries because part of their execution depends on remote data access. When Query Store is configured to capture all queries, we can use it to identify queries that reference external tables and review whether those query IDs should participate in FORCE_LAST_GOOD_PLAN. From a practical perspective, external-table queries may not always be the best candidates for Automatic Plan Correction, especially when the expected benefit of automatic plan forcing is not clear. For that reason, the goal of this article is simple: identify queries that reference external tables and, when appropriate, exclude selected query IDs from Automatic Plan Correction. If we review the execution plan for the following query: DECLARE @Region nvarchar(50) = N'EMEA' SELECT CustomerId, CustomerName, Region FROM dbo.ExternalCustomers WHERE Region = @Region; We can see that the plan includes a Remote Query operator. This means that the query is not only accessing local data; part of the execution depends on remote data access through the external table. For this type of query, Automatic Plan Correction may not provide the same clear benefit as it does for fully local queries. The performance may depend not only on the local execution plan, but also on the remote database, the external data source, network latency, and the amount of data returned from the remote side. For that reason, queries referencing external tables are good candidates for review before allowing them to participate in FORCE_LAST_GOOD_PLAN. In this scenario, the first step was to identify the Query Store query_id associated with the query referencing the external table. Since the query text was available in Query Store, we searched for the external table name in sys.query_store_query_text. SELECT q.query_id, p.plan_id, p.is_forced_plan, p.plan_forcing_type_desc, p.force_failure_count, p.last_force_failure_reason_desc, p.last_execution_time, qt.query_sql_text FROM sys.query_store_query_text AS qt INNER JOIN sys.query_store_query AS q ON qt.query_text_id = q.query_text_id INNER JOIN sys.query_store_plan AS p ON q.query_id = p.query_id WHERE qt.query_sql_text LIKE N'%ExternalCustomers%' ORDER BY p.last_execution_time DESC; Once the query_id was identified, the next step was to exclude that specific query from Automatic Plan Correction by setting FORCE_LAST_GOOD_PLAN to OFF for that query_id. EXECUTE sys.sp_configure_automatic_tuning @option = 'FORCE_LAST_GOOD_PLAN', @type = 'QUERY', @type_value = N'<query_id>', @option_value = 'OFF'; For example: EXECUTE sys.sp_configure_automatic_tuning @option = 'FORCE_LAST_GOOD_PLAN', @type = 'QUERY', @type_value = N'1574', @option_value = 'OFF'; This does not disable Automatic Plan Correction for the entire database. It only tells Automatic Plan Correction to ignore this specific Query Store query ID for FORCE_LAST_GOOD_PLAN. With this approach, Automatic Plan Correction can remain enabled for the rest of the database workload, while selected queries that depend on external or remote data access can be reviewed and excluded individually when automatic plan forcing is not expected to provide a clear benefit.Scaling Write Throughput in Azure Database for MySQL Using Application-Level Sharding
This blog post walks through scaling write throughput in Azure Database for MySQL using application level sharding. It starts with the why behind sharding and then builds a complete C# implementation that spreads writes across three Azure Database for MySQL Flexible Servers. Why Shard in the First Place? This post focuses specifically on scaling write throughput. A well-tuned single primary node can take you remarkably far, and techniques such as indexing strategies, write batching, redo log optimization, and vertical compute scaling each deliver real, lasting value. For many workloads, these optimizations are all you will ever need. That said, as write volume continues to grow, a single primary eventually approaches its practical capacity, and at that point the most durable way to keep scaling is to distribute the write workload across multiple primary instances. This architecture is what we call sharding. When you reach this inflection point, there are two primary patterns for managing multiple write nodes: Proxy or Middleware Layer Sharding: A sharding aware proxy sits between the application and a pool of Azure Database for MySQL instances, routing queries based on a shard key. While this abstracts the underlying topology from the application layer, it introduces an additional, complex component to operate, secure, scale, and patch. Application Layer Sharding: The application itself resolves the destination shard key and determines which of the N Azure Database for MySQL instances should receive a write before ever opening a database connection. Each backend target remains a completely standard, independent Azure Database for MySQL instance. This post explores the second approach. The core appeal of application layer sharding is architectural simplicity: it introduces zero infrastructure overhead and eliminates an extra network hop. Every shard behaves exactly like a standalone instance, meaning your existing backup, restore, monitoring pipelines, and the Azure portal function seamlessly without modification. The explicit tradeoff is that you forgo cross shard joins and distributed transactions in exchange for absolute predictability and control over data access patterns. The Plan We will build a small order management service that distributes its data across three Azure Database for MySQL instances that already exist. The application, written in C# on .NET 8, owns the partitioning logic. The premise: the three servers are already provisioned, the firewalls are configured, the network paths are established, and each server has its own administrative credentials. We are not provisioning infrastructure in this post. we are writing the application code that consumes it. mysql-shard-0.mysql.database.azure.com user: shard0_admin pwd: <secret-0> mysql-shard-1.mysql.database.azure.com user: shard1_admin pwd: <secret-1> mysql-shard-2.mysql.database.azure.com user: shard2_admin pwd: <secret-2> Each server hosts an identical appdb database with the same schema: CREATE TABLE users ( user_id BIGINT NOT NULL PRIMARY KEY, email VARCHAR(255) NOT NULL, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, UNIQUE KEY uq_email (email) ); CREATE TABLE orders ( order_id BIGINT NOT NULL PRIMARY KEY, user_id BIGINT NOT NULL, amount_cents INT NOT NULL, created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, KEY ix_user (user_id) ); Two design decisions in this schema warrant explanation: No AUTO_INCREMENT for user_id or order_id. Two shards would otherwise generate the same value 42 independently. Instead, we assign identifiers in the application, using a scheme such as Snowflake, ULID, or UUIDv7. orders carries user_id, and we route by it. This is the single most important rule of sharding: choose a shard key that keeps related data colocated, so that the common queries remain on a single shard. A note on UNIQUE KEY uq_email. A unique index enforces uniqueness only within a single physical shard. Because we route by user_id, two users with different IDs and the same email may land on different shards, and both inserts will succeed. If you require globally unique emails, two options exist: (a) maintain a separate email → user_id lookup table on a single "directory" server and write to it first within an idempotent flow, or (b) shard the users table by a hash of email instead. We retain user_id routing throughout this post because it is the correct choice for orders, and we treat per shard email uniqueness as a best effort guard rather than a hard global invariant. How the Partitioning Works The naive approach to sharding is shard = hash(key) % N. This works until you need to add a fourth server, at which point roughly 75% of your data must move. In any system of meaningful size, that is prohibitively expensive. The established solution is virtual buckets. You hash the key into a large, fixed bucket space (here, 1024), then map buckets to physical shards. When you add capacity, you relocate only buckets; you never rehash the entire dataset. In production, the bucket_to_shard_map typically resides in a system such as Azure App Configuration or etcd, so that you can rebalance without redeploying. For this post, we keep it as an in-memory array seeded at startup, which is straightforward to replace later. The Project ShardingDemo/ ├── ShardingDemo.csproj ├── appsettings.json ├── Models.cs ├── ShardRouter.cs ├── UserRepository.cs └── Program.cs ShardingDemo.csproj <Project Sdk="Microsoft.NET.Sdk"> <PropertyGroup> <OutputType>Exe</OutputType> <TargetFramework>net8.0</TargetFramework> <Nullable>enable</Nullable> <ImplicitUsings>enable</ImplicitUsings> </PropertyGroup> <ItemGroup> <PackageReference Include="MySqlConnector" Version="2.6.0" /> <PackageReference Include="Microsoft.Extensions.Hosting" Version="8.0.0" /> <PackageReference Include="Microsoft.Extensions.Configuration.Binder" Version="8.0.0" /> </ItemGroup> <ItemGroup> <Content Include="appsettings.json" CopyToOutputDirectory="PreserveNewest" /> </ItemGroup> </Project> appsettings.json Shards is an ordered list, and a shard's position in the array is its logical ID. { "Shards": [ { "Host": "mysql-shard-0.mysql.database.azure.com", "Database": "appdb", "User": "shard0_admin", "Password": "REPLACE_ME_0" }, { "Host": "mysql-shard-1.mysql.database.azure.com", "Database": "appdb", "User": "shard1_admin", "Password": "REPLACE_ME_1" }, { "Host": "mysql-shard-2.mysql.database.azure.com", "Database": "appdb", "User": "shard2_admin", "Password": "REPLACE_ME_2" } ] } Models.cs namespace ShardingDemo; public sealed record User(long UserId, string Email, DateTime CreatedAt); public sealed record Order(long OrderId, long UserId, int AmountCents, DateTime CreatedAt); public sealed class ShardConfig { public required string Host { get; init; } public required string Database { get; init; } public required string User { get; init; } public required string Password { get; init; } } ShardRouter.cs using System.Security.Cryptography; using System.Text; using MySqlConnector; namespace ShardingDemo; public sealed class Shard : IAsyncDisposable { public int Id { get; } public MySqlDataSource DataSource { get; } public Shard(int id, ShardConfig cfg) { Id = id; var csb = new MySqlConnectionStringBuilder { Server = cfg.Host, Port = 3306, Database = cfg.Database, UserID = cfg.User, Password = cfg.Password, SslMode = MySqlSslMode.Required, Pooling = true, MinimumPoolSize = 2, MaximumPoolSize = 100, ConnectionTimeout = 10, DefaultCommandTimeout = 30, }; DataSource = new MySqlDataSourceBuilder(csb.ConnectionString).Build(); } public ValueTask DisposeAsync() => DataSource.DisposeAsync(); } public sealed class ShardRouter : IAsyncDisposable { private const int VirtualBuckets = 1024; private readonly IReadOnlyList<Shard> _shards; private readonly int[] _bucketToShardId; public ShardRouter(IEnumerable<ShardConfig> configs) { _shards = configs.Select((c, i) => new Shard(i, c)).ToList(); // Even distribution. Replace with a map loaded from your control plane for live rebalancing. _bucketToShardId = new int[VirtualBuckets]; for (int i = 0; i < VirtualBuckets; i++) _bucketToShardId[i] = i % _shards.Count; } public IReadOnlyList<Shard> AllShards => _shards; private static int BucketFor(long shardKey) { byte[] hash = MD5.HashData(Encoding.ASCII.GetBytes(shardKey.ToString())); // Use the first byte pair as an unsigned value, then map it into the bucket space. int value = (hash[0] << 8) | hash[1]; return value % VirtualBuckets; } public Shard ShardForKey(long shardKey) { int bucket = BucketFor(shardKey); return _shards[_bucketToShardId[bucket]]; } public async ValueTask DisposeAsync() { foreach (var s in _shards) await s.DisposeAsync(); } } UserRepository.cs Observe that every per user method calls ShardForKey(userId), even when inserting an order. This is the colocation rule at work. An order and its owning user always reside on the same shard, so queries for a single user only ever reach one shard. Only the cross-shard aggregate (TotalRevenueCentsAsync) must fan out. using MySqlConnector; namespace ShardingDemo; public sealed class UserRepository { private readonly ShardRouter _router; public UserRepository(ShardRouter router) { _router = router; } public async Task CreateUserAsync(long userId, string email, CancellationToken ct = default) { var shard = _router.ShardForKey(userId); await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = "INSERT INTO users (user_id, email) VALUES (@id, Email)"; cmd.Parameters.AddWithValue("@id", userId); cmd.Parameters.AddWithValue("@email", email); await cmd.ExecuteNonQueryAsync(ct); } public async Task<User?> GetUserAsync(long userId, CancellationToken ct = default) { var shard = _router.ShardForKey(userId); await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = "SELECT user_id, email, created_at FROM users WHERE user_id = ID"; cmd.Parameters.AddWithValue("@id", userId); await using var reader = await cmd.ExecuteReaderAsync(ct); if (!await reader.ReadAsync(ct)) return null; return new User(reader.GetInt64(0), reader.GetString(1), reader.GetDateTime(2)); } public async Task AddOrderAsync(long orderId, long userId, int amountCents, CancellationToken ct = default) { // Routed by user_id, so orders colocate with their owning user. var shard = _router.ShardForKey(userId); await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = """ INSERT INTO orders (order_id, user_id, amount_cents) VALUES (@oid, @uid, amt) """; cmd.Parameters.AddWithValue("@oid", orderId); cmd.Parameters.AddWithValue("@uid", userId); cmd.Parameters.AddWithValue("@amt", amountCents); await cmd.ExecuteNonQueryAsync(ct); } public async Task<IReadOnlyList<Order>> GetOrdersForUserAsync(long userId, CancellationToken ct = default) { var shard = _router.ShardForKey(userId); await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = """ SELECT order_id, user_id, amount_cents, created_at FROM orders WHERE user_id = @uid """; cmd.Parameters.AddWithValue("@uid", userId); var list = new List<Order>(); await using var reader = await cmd.ExecuteReaderAsync(ct); while (await reader.ReadAsync(ct)) { list.Add(new Order( reader.GetInt64(0), reader.GetInt64(1), reader.GetInt32(2), reader.GetDateTime(3))); } return list; } /// <summary>Cross shard fanout.</summary> public async Task<long> TotalRevenueCentsAsync(CancellationToken ct = default) { var tasks = _router.AllShards.Select(async shard => { await using var conn = await shard.DataSource.OpenConnectionAsync(ct); await using var cmd = conn.CreateCommand(); cmd.CommandText = "SELECT COALESCE(SUM(amount_cents), 0) FROM orders"; var result = await cmd.ExecuteScalarAsync(ct); return Convert.ToInt64(result); }); var perShard = await Task.WhenAll(tasks); return perShard.Sum(); } } Program.cs using Microsoft.Extensions.Configuration; using Microsoft.Extensions.DependencyInjection; using Microsoft.Extensions.Hosting; using ShardingDemo; var builder = Host.CreateApplicationBuilder(args); // Bind Shards:[] from appsettings.json (override with user-secrets / env vars / Key Vault) var shardConfigs = builder.Configuration .GetSection("Shards") .Get<List<ShardConfig>>() ?? throw new InvalidOperationException("No 'Shards' section configured."); if (shardConfigs.Count == 0) throw new InvalidOperationException("At least one shard must be configured."); builder.Services.AddSingleton(_ => new ShardRouter(shardConfigs)); builder.Services.AddSingleton<UserRepository>(); using var host = builder.Build(); var repo = host.Services.GetRequiredService<UserRepository>(); var router = host.Services.GetRequiredService<ShardRouter>(); (long Id, string Email)[] users = { (1001, "ada@example.com"), (2002, "linus@example.com"), (3003, "grace@example.com"), (4004, "alan@example.com"), }; foreach (var (id, email) in users) { await repo.CreateUserAsync(id, email); Console.WriteLine($"user {id} -> shard {router.ShardForKey(id).Id}"); } await repo.AddOrderAsync(orderId: 9001, userId: 1001, amountCents: 4999); await repo.AddOrderAsync(orderId: 9002, userId: 1001, amountCents: 1299); await repo.AddOrderAsync(orderId: 9003, userId: 2002, amountCents: 8800); Console.WriteLine($"\nAda: {await repo.GetUserAsync(1001)}"); Console.WriteLine($"Ada's orders: {(await repo.GetOrdersForUserAsync(1001)).Count}"); Console.WriteLine($"\nTotal revenue across 3 shards: " + $"${await repo.TotalRevenueCentsAsync() / 100m:F2}"); await router.DisposeAsync(); Tracing One Request End to End Consider GetOrdersForUserAsync(1001): ShardForKey(1001) → MD5("1001") → first two bytes as a number → % 1024 → a bucket in the range 0..1023. bucket % 3 → a physical shard → for example mysql-shard-2.mysql.database.azure.com. The MySqlDataSource provides a pooled, TLS encrypted connection authenticated as shard2_admin. The query runs against shard 2's local ix_user index, with no fan out and at single server speed. Every call with userId = 1001, whether GetUser, AddOrder, or GetOrdersForUser, lands on the same shard. That is why orders JOIN users ON orders.user_id = users.user_id WHERE user_id = 1001 executes within a single shard, with no cross-shard traffic. Conclusion The essential point is this. Once a single primary can no longer absorb your write load, sharding becomes a durable answer, and implementing it at the application layer keeps every part of the system explicit and comprehensible. When write volume or dataset size outgrows a single primary, application layer sharding provides several benefits. N independent Azure Database for MySQL instances, each absorbing 1/N of the write traffic. Queries by user that remain on a single shard and behave like an ordinary, modestly sized database. A bucket map approach that allows you to add a fourth, fifth, or Nth shard later by relocating slices of data rather than rehashing the entire dataset. A failure of one shard that affects 1/N of your users rather than all of them. These benefits come at a genuine cost. You must generate identifiers in the application, global uniqueness requires a secondary lookup table, and aggregate queries fan out across shards. A cross shard write, one that must atomically update data on two different shards, can no longer rely on a single database transaction. Instead it needs an orchestrated sequence of local transactions, where each step carries a compensating action that undoes its effect if a later step fails. None of these are insurmountable. They are simply responsibilities you now assume. Sharding is a deliberate step to take only once a single primary has genuinely exhausted its write headroom. When you reach that point, the implementation in this post is a representative blueprint. Stay Connected We welcome your feedback and invite you to share your experiences or suggestions at AskAzureDBforMySQL@service.microsoft.com Thank you for choosing Azure Database for MySQL!154Views2likes0CommentsKB5089573 forced install + Lenovo network stack degradation (HTTPS latency, build 26200.8524)
KB5089573 installed automatically on a Lenovo system despite updates being paused and no preview updates enabled. After installation, the system jumped to build 26200.8524 and the network stack degraded severely. Heavy HTTPS sites (LinkedIn, Google Finance, YouTube) take 20–60 seconds to load across all browsers. Speedtest is normal, other devices on the same network are unaffected, and both Edge and Chrome show identical latency. DISM shows no package for KB5089573 and the update cannot be uninstalled. Looking for correlation data from other Lenovo users.116Views0likes3CommentsLessons Learned #540:Bulk Insert Throughput in Azure SQL Hyperscale with Partitioned Heap Tables
In this lesson learned, I would like to share an interesting scenario working on a service request where our customer was running a high-volume data load process in Azure SQL Database Hyperscale. The workload was based on a common pattern: Recreate a staging table. Load a large number of rows using bulk insert. The bulk insert showed unstable execution times and became the main area to investigate. The process was loading a very large number of rows into an Azure SQL Database Hyperscale database. The process used a staging table that was initially loaded as a heap. The main concern was the inconsistent execution time during the load process. Why Manually Adding Data Files Was Not the Right Direction In Azure SQL Database Hyperscale, the storage architecture is different from a traditional SQL Server deployment. The data layout and storage management are handled internally by the service. Because of this architecture, manually creating or pre-allocating multiple data files is not the same tuning option that we may consider in SQL Server on-premises or SQL Server running on Azure Virtual Machines. For this reason, the troubleshooting focus moved from manual file layout configuration to the actual workload pattern, waits, concurrency, batch size, and staging table design. What We Observed During the bulk insert phase, waits such as PAGELATCH_EX were observed. Since the staging table was loaded as a heap and the clustered primary key was created only after the bulk insert completed, OPTIMIZE_FOR_SEQUENTIAL_KEY was not directly applicable to the bulk insert phase. This changed the direction of the investigation. Instead of focusing on last-page insert contention on an existing clustered index, the analysis moved toward heap insert behavior, allocation contention, concurrency, batch size, and whether a different staging table design could help. First Recommendation: Start with Low-Impact Changes Before changing the table design, the first recommendation was to test the least intrusive changes: Reduce the number of concurrent bulk insert sessions. Increase the batch size, for example from 10,000 rows to 50,000 or 100,000 rows. Test TABLOCK on the dedicated heap staging table. The goal was to avoid assuming that more concurrency would always reduce the total execution time. In some high-volume load scenarios, excessive concurrency may increase contention and make the process less stable. The Interesting Design Option: Partitioned Heap Staging Table One of the most interesting design options was to evaluate a partitioned heap staging table. The idea is simple: instead of loading all rows into a non-partitioned heap staging table, the staging table can be created on the same partition scheme used by the target table, using the same partitioning column. This does not mean that a partitioned heap will always be faster. However, it can be a useful design option when: The bulk load phase is affected by allocation or latch contention. Concurrent load processes can naturally distribute rows across different partition ranges. The staging table is used only as an intermediate structure.Lessons Learned The main lessons from this scenario were: In Azure SQL Database Hyperscale, manually managing multiple data files is not the right tuning direction. PAGELATCH_EX during heap loading may point to concurrency or allocation-related contention. Reducing concurrency can sometimes improve total throughput. Larger batch sizes may provide better results than many small batches. TABLOCK on a dedicated heap staging table is a low-impact test worth evaluating. A partitioned heap staging table can be a valid second-phase design option when the load can be distributed across partition ranges. The best approach is to test small, measurable changes before introducing architectural redesigns. Final Thoughts A partitioned heap staging table can be a powerful option, but only when it is tested carefully and when the workload pattern can benefit from partition distribution.Performance Tuning Cold Starts, Scaling Delays, and Startup Latency in Azure Container Apps
Introduction There is a particular kind of frustration that comes not when your application fails to start, but when it starts too slowly. The container is running, the health probes pass, your monitoring shows green — and yet every few minutes a user somewhere in the world hits a request that takes 15 seconds to respond. Your support team starts getting tickets. Your SLA dashboard turns amber. This is the cold start problem, and it is one of the most widely discussed pain points with any serverless container platform. Azure Container Apps is no different. But what most engineers do not realize is that the cold start is only one part of the story. Scaling delays, inefficient image layers, wrong resource allocations, and misconfigured KEDA rules all compound to create latency spikes that feel indistinguishable from cold starts but have completely different root causes and fixes. In this part of the series, we break down each cause systematically and show you exactly how to address it. Understanding What "Cold Start" Actually Means in Container Apps Before we fix anything, it helps to understand what is actually happening during a cold start. When a new replica is created, Azure Container Apps needs to do several things in sequence before your application can serve a single request: The platform schedules the new replica on available infrastructure. The container runtime pulls the image layers that are not already cached on that node. The container starts and the process inside it begins executing. Your application framework initializes (the .NET DI container, Django's ORM layer, loaded ML models, etc.). The readiness probe passes, signaling that the replica can accept traffic. Every one of these steps takes time. The total duration is your cold start latency. When you have `minReplicas: 0`, this full cycle happens for every "first request after idle" scenario. With `minReplicas: 1`, steps 1 and 2 are already done, but steps 3–5 still happen whenever a new replica is created due to scaling out. Scenario 1: Requests Spike to 10+ Seconds After a Period of Inactivity What You See Everything looks fine during load testing, but the next morning after a quiet night, the first user to hit the app gets a timeout or a very slow response. You check your Application Insights or Log Analytics and you see exactly one request with a dramatically higher duration than all the others. Why This Happens You have `minReplicas` set to `0` (or it defaults to 0). When there are no replicas running and a new request arrives, the entire cold start sequence kicks off — and the request waits in the ingress queue the entire time. Depending on your image size and application initialization time, this can easily reach 15–30 seconds for a .NET application with a large DI graph, or even longer for a Python application that imports heavy libraries. The Fix Option A (Recommended for most workloads): Set `minReplicas` to 1. This ensures at least one replica is always warm and ready to handle requests. You will pay for that one replica's compute even during idle periods, but you eliminate the cold start for your users: az containerapp update --name my-dotnet-api --resource-group my-rg --min-replicas 1 --max-replicas 10 Or in your Container App YAML: scale: minReplicas: 1 maxReplicas: 10 rules: - name: http-scaling-rule http: metadata: concurrentRequests: "10" Option B: Reduce image size to speed up the pull. Every megabyte in your container image adds time to cold starts. A production .NET API should not be a 2 GB image. Use multi-stage builds to strip away the SDK, test tools, and development dependencies: # Stage 1: Build FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build WORKDIR /src COPY ["MyApp.csproj", "."] RUN dotnet restore COPY . . RUN dotnet publish -c Release -o /app/publish --no-restore # Stage 2: Runtime only - much smaller FROM mcr.microsoft.com/dotnet/aspnet:8.0 AS final WORKDIR /app COPY --from=build /app/publish . # Run as non-root for security USER app EXPOSE 8080 ENTRYPOINT ["dotnet", "MyApp.dll"] For Django, the equivalent pattern is: FROM python:3.11-slim AS base # Install only production dependencies WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt && find /usr/local -name "*.pyc" -delete && find /usr/local -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null || true COPY . . RUN SECRET_KEY=placeholder python manage.py collectstatic --noinput USER nobody EXPOSE 8000 CMD ["gunicorn", "myproject.wsgi:application", "--bind", "0.0.0.0:8000", "--workers", "2"] Option C: Use a startup probe to manage the readiness window. If your app genuinely needs 20–30 seconds to initialize (loading configuration, warming caches, establishing connection pools), configure a startup probe separately from your liveness probe. This gives the container time to start without the liveness probe killing it prematurely: probes: - type: Startup httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 12 # 12 * 5s = 60 seconds total window - type: Liveness httpGet: path: /health port: 8080 periodSeconds: 10 failureThreshold: 3 - type: Readiness httpGet: path: /health/ready port: 8080 periodSeconds: 5 failureThreshold: 3 Scenario 2: New Replicas Lag Behind Traffic Spikes What You See Your application handles steady traffic just fine. Then a sudden burst arrives — a marketing email goes out, a scheduled batch job triggers API calls, or a downstream system fires webhooks — and for 30–60 seconds your error rate jumps and your latency spikes. After that window, everything recovers. The scaling logs show new replicas were created, but they came online too late. Why This Happens KEDA (the scaling engine behind Container Apps) works reactively. By default, HTTP-based scaling triggers new replicas when concurrent requests exceed the configured threshold. But there is an inherent delay between the moment traffic spikes, the moment KEDA detects it, and the moment a new replica is warm and serving traffic. This window is where your users experience the pain. Additionally, if your image pull takes a long time (large image, first pull on a new node), the new replica arrives even later. KEDA cannot compensate for slow image pulls. The Fix Step 1 — Tune your KEDA scaling rules to trigger earlier. Rather than waiting until you are already at capacity, configure scaling to trigger with a lower concurrency threshold. If your app can handle 20 concurrent requests comfortably, set the threshold to 10 so new replicas spin up before you are overwhelmed: scale: minReplicas: 1 maxReplicas: 20 rules: - name: http-rule http: metadata: concurrentRequests: "10" # Scale earlier, not at capacity For Azure Service Bus or Event Hubs-triggered scaling (common in job-style workloads), use a queue length threshold that gives you a buffer: scale: minReplicas: 0 maxReplicas: 30 rules: - name: servicebus-rule custom: type: azure-servicebus metadata: queueName: my-processing-queue namespace: my-servicebus-namespace messageCount: "5" # Scale when queue depth reaches 5, not 100 auth: - secretRef: servicebus-connection triggerParameter: connection Step 2 — Pre-warm your connection pools in .NET. One of the biggest contributors to new replica slowness is the time spent establishing database connections and other external connections. The first request that hits a new replica bears the cost of opening the connection pool. Configure your connection pool to warm up eagerly at startup: // In Program.cs, after building the app if (app.Environment.IsProduction()) { // Warm up the database connection pool before accepting traffic using var scope = app.Services.CreateScope(); var dbContext = scope.ServiceProvider.GetRequiredService<AppDbContext>(); await dbContext.Database.ExecuteSqlRawAsync("SELECT 1"); } await app.RunAsync(); Step 3 — Enable HTTP/2 keep-alive and connection reuse. In .NET applications running behind the Container Apps ingress, configure your HTTP client to use connection pooling properly: builder.Services.AddHttpClient("downstream-api", client => { client.BaseAddress = new Uri("https://my-downstream-service"); client.DefaultRequestVersion = new Version(2, 0); }) .ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler { PooledConnectionLifetime = TimeSpan.FromMinutes(5), PooledConnectionIdleTimeout = TimeSpan.FromMinutes(2), MaxConnectionsPerServer = 20 }); Scenario 3: Django Startup Is Slow Due to Import Time What You See Your Django application takes 8–12 seconds to start even on a warm node. You check the Gunicorn startup logs and see it spending most of that time in Python module imports before it ever processes a request. Why This Happens Python's import system is synchronous and single-threaded. When you import `django`, `rest_framework`, `pandas`, `numpy`, or any large library, Python reads and executes every module file in the dependency chain. A Django project with Django REST Framework, Celery, and a few third-party packages can easily spend 5–8 seconds just on imports. Multiply that by the number of Gunicorn workers (each is a separate process that imports everything independently) and startup time balloons. The Fix Step 1 — Profile import time to find the worst offenders. Add this to your Dockerfile's entrypoint or run it manually: # Run this in a container shell to see which imports take the longest python -X importtime -c "import django; django.setup()" 2>&1 | sort -k2 -rn | head -20 Step 2 — Use lazy imports for heavy dependencies that are not needed at startup. Instead of importing everything at the module level, defer imports to the functions that actually need them: # Instead of this at the top of your file: import pandas as pd import numpy as np # Do this — import only when the function is actually called: def process_data(data): import pandas as pd import numpy as np df = pd.DataFrame(data) return df.describe().to_dict() Step 3 — Reduce Gunicorn worker count for memory-constrained environments. Having too many workers means too many independent Python processes all importing everything at the same time. For Container Apps with 0.5–1 vCPU, 2 workers is usually the right starting point: CMD ["gunicorn", "myproject.wsgi:application", "--bind", "0.0.0.0:8000", "--workers", "2", "--worker-class", "gthread", "--threads", "4", "--timeout", "120", "--keep-alive", "5", "--log-level", "info"] Step 4 — Consider switching from Gunicorn to Uvicorn for async Django. If you are on Django 4.x with ASGI support, Uvicorn with async workers can handle significantly more concurrent requests per worker than synchronous Gunicorn workers: CMD ["uvicorn", "myproject.asgi:application", "--host", "0.0.0.0", "--port", "8000", "--workers", "2", "--log-level", "info"] Scenario 4: Resource Limits Are Causing Throttling and Slow Responses What You See Your application starts fine and handles light traffic well, but under moderate to heavy load — even well below your max replicas — individual requests become slow and CPU metrics show your replicas running near 100% utilization. You may also see the .NET GC (garbage collector) running very frequently, or Django showing slow database queries that are actually fast queries being delayed because the process has no CPU to run. Why This Happens Container Apps defaults to 0.25 vCPU and 0.5 Gi memory if you do not specify resource limits. For a production .NET API or a Django application serving real traffic, this is almost always too little. When a container hits its CPU limit, the container runtime throttles it — the process continues to run but gets less CPU time, making everything slower without any obvious error signal. The Fix Step 1 — Measure actual resource usage before guessing. Query Log Analytics for actual CPU and memory usage to establish a baseline: ContainerAppSystemLogs_CL | where ContainerAppName_s == "my-dotnet-api" | where TimeGenerated > ago(7d) | summarize AvgCpuUsage = avg(todouble(CpuUsageNanoCores_d)) / 1000000, MaxCpuUsage = max(todouble(CpuUsageNanoCores_d)) / 1000000, AvgMemoryMB = avg(todouble(MemoryWorkingSetBytes_d)) / 1048576, MaxMemoryMB = max(todouble(MemoryWorkingSetBytes_d)) / 1048576 by bin(TimeGenerated, 1h) | order by TimeGenerated desc Step 2 — Update resource allocations based on what you observed. az containerapp update --name my-dotnet-api --resource-group my-rg --cpu 0.5 --memory 1.0Gi Container Apps has specific valid CPU/memory combinations. The valid pairs are: `0.25 vCPU / 0.5 Gi`, `0.5 vCPU / 1.0 Gi`, `0.75 vCPU / 1.5 Gi`, `1.0 vCPU / 2.0 Gi`, and up to `4.0 vCPU / 8.0 Gi`. You cannot mix arbitrary values. Step 3 — Configure .NET GC for server workloads. By default, .NET uses the workstation GC mode which is tuned for interactive applications. For server containers, use server GC mode and configure the heap size appropriately: // In runtimeconfig.template.json or via environment variables { "configProperties": { "System.GC.Server": true, "System.GC.HeapHardLimit": 805306368, "System.GC.HighMemoryPercent": 75 } } Or as environment variables in your Container App: az containerapp update --name my-dotnet-api --resource-group my-rg --set-env-vars "DOTNET_GCConserveMemory=5" "DOTNET_GCHeapHardLimit=805306368" Measuring the Impact of Your Changes After making changes, use this Log Analytics query to track your startup times over the past 24 hours and confirm the improvements: ContainerAppConsoleLogs_CL | where ContainerAppName_s == "my-dotnet-api" | where Log_s contains "Application started" or Log_s contains "Now listening on" | project TimeGenerated, Log_s, ContainerName_s | order by TimeGenerated desc And check request duration percentiles in Application Insights: requests | where cloud_RoleName == "my-dotnet-api" | where timestamp > ago(24h) | summarize p50 = percentile(duration, 50), p90 = percentile(duration, 90), p99 = percentile(duration, 99), count = count() by bin(timestamp, 1h) | order by timestamp desc Summary: Your Performance Tuning Quick Reference Here is a quick decision guide based on what you are seeing: Symptom Most Likely Cause First Fix to Try First request after idle is very slow `minReplicas: 0` Set `minReplicas: 1` Spike period has errors, then recovers KEDA scaling too slow Lower concurrentRequests threshold New replicas start slowly Large image size Multi-stage Docker build High CPU at moderate traffic Under-allocated resources Increase CPU/memory allocation Django startup is slow Heavy Python imports Profile and defer imports .NET app slow under load Workstation GC mode Enable server GC References and Sample Resources Use these links to tune startup performance, scaling behavior, and runtime efficiency. Azure Container Apps docs (core) Scale applications in Azure Container Apps Workload profiles overview Health probes in Azure Container Apps Revisions in Azure Container Apps Monitoring and logging in Azure Container Apps Runtime and framework performance references Docker multi-stage builds .NET performance best practices for ASP.NET Core .NET runtime GC configuration Django performance optimization Uvicorn deployment guide Scaling engine references and samples KEDA concepts and documentation KEDA scaler samples Azure Samples: .NET on Azure Container Apps Azure Samples: Python on Azure Container Apps What's Next In Part 3, we go deeper into the most specialized and complex scenario in this series: troubleshooting AI workloads in Azure Container Apps. Loading large ML models, managing GPU and CPU resource constraints, and dealing with memory pressure from inference workloads all require techniques that go beyond standard web application troubleshooting. Part of the series: Troubleshooting Azure Container Apps in Production Next: Part 3 — Troubleshooting ML Model Loading, GPU Issues, and Memory Pressure in Azure Container AppsYour PostgreSQL workflow just found its new home in Cursor
TL; DR: Our Visual Studio Code extension for PostgreSQL is now available on the Open VSX registry: Cursor users get first-class database tooling without leaving the editor that already understands their code. The context switch problem If you use Cursor, you know the feeling. You’re deep in an agentic flow. Composer is scaffolding a feature across multiple files. Tab is anticipating your next move. Then you need to check a table's schema or run a quick query, so you switch to a different tool, and then you lose your flow state, and spend 30 seconds remembering which connection goes to which environment. That context switch is expensive. Not in minutes, but in momentum. Why we built for Cursor (and Open VSX) Cursor is built on the VS Code ecosystem, which means it supports VS Code extensions natively. It uses the Open VSX registry: an open, vendor-neutral extension marketplace where database tooling options have been limited. We saw an opportunity: bring a modern PostgreSQL extension directly to where developers do their most productive work. By publishing to Open VSX, we make sure that developers across the entire VS Code-compatible ecosystem, including Cursor, Windsurf, AWS Kiro, Theia, and Ona all have access without workarounds. Where AI-powered editing meets database awareness Here’s what gets interesting. Cursor indexes your entire codebase semantically. It knows your Drizzle schemas, your raw SQL files, and your migration directories. Our extension completes the picture by giving the editor a live connection to the actual database. Here’s where they intersect: Schema explorer in your sidebar: browse tables, columns, indexes, and functions without leaving the editor. When Cursor’s agent asks “what columns does the users table have?”, the answer is already visible. Screenshot: Object Explorer sidebar showing tables, columns, and indexes expanded Connection-aware IntelliSense: autocomplete table names, column names, and functions based on your live database schema. This pairs naturally with Cursor’s Tab completions: the AI writes the application logic, and IntelliSense validates the SQL. Inline EXPLAIN diagnostics: catch performance issues before they ship. Write a query and see whether it uses an index or triggers a sequential scan, all without running a separate tool. Zero-config connection discovery: we detect .env files, docker-compose.yml, and ORM connection strings in your project. Your database connection follows your workspace, not a global settings file buried three menus deep. Result export and inline execution: select SQL, run it, and see results in a clean panel. Export to JSON or CSV when you need to share findings with your team. Features that make Cursor + PostgreSQL even better Beyond the basics, the extension includes capabilities that pair especially well with AI-powered workflows: MCP server for AI assistants: the extension registers a Model Context Protocol (MCP) server, so Cursor’s agent can discover and interact with your PostgreSQL databases directly through a standardized tool interface. Ask your AI assistant to inspect a table, run a diagnostic query, or analyze a plan: it has the tools to do it. Agent Mode database tools: dedicated DBAgent MCP tools give AI assistants richer database-analysis capabilities, from schema introspection to performance diagnostics and instruction management. Query plan visualization: explore EXPLAIN output in four synchronized views: an interactive node graph, icicle chart, sortable table, and raw source. Color-coded severity groups expose bottlenecks at a glance, and AI-assisted analysis provides optimization suggestions. Performance dashboard: investigate database performance with DB load charts, query activity, wait-event analysis, session health, and blocking chains. Use natural language to inspect trends, identify bottlenecks, and generate diagnostic SQL. Object Explorer search: find database objects by name without expanding the tree. Search across connections, databases, and schemas. Filter by object type or schema name and navigate directly to any result. Schema-aware “New Query”: right-click a schema in Object Explorer to open a new query with the appropriate search_path already set. No more manual SET search_path before writing queries. Multi-source connection profiles: save connection profiles to your user settings, workspace, or folder. Check workspace profiles into source control so every team member gets the right database connections when they open the project. SSH tunneling built in: connect to databases on private networks through SSH tunnels configured directly in the connection dialog, with ssh-agent support for private key authentication. Built for how you actually work Modern development means ephemeral environments, branch-specific databases, and containers everywhere. The extension is designed around this reality: Automatic detection of PostgreSQL instances running in Docker Project-scoped connections that travel with your workspace Support for standard PostgreSQL connections Integration with both local and cloud versions of PostgreSQL from multiple vendors, and first-class support for Azure Database for PostgreSQL and Azure HorizonDB, with provisioning, backup management, and network configuration: all without leaving the editor. Status bar indicator showing your active database at a glance Get started Install from the Open VSX Registry: search for it in Cursor’s extension panel or install the .vsix directly. Your existing VS Code workflow carries over unchanged. If you’re already using Cursor for its agentic capabilities, adding database awareness to the editor means fewer tabs, fewer context switches, and a tighter feedback loop between your application code and the data layer underneath it. Available now on Open VSX. Works with Cursor, Antigravity, and all the VS Code compatible editors.2.2KViews4likes0Comments