Introduction
In the world of database management, performance is key. One of the common challenges faced by database administrators and developers is tuning SQL queries that use functions. This blog will demonstrate this through an example, detailing the steps taken to improve the performance of SQL queries that use functions in PostgreSQL. The queries in the example mentioned are executed on a 4 vCore general purpose SKU on Azure Database for PostgreSQL flexible server
PostgreSQL offers different volatility categories for functions: VOLATILE, STABLE, and IMMUTABLE. Understanding these categories and using them appropriately can significantly impact the performance of your queries:
- VOLATILE: Use for functions that can modify the database or return different results on successive calls with the same arguments.
- STABLE: Use for functions that do not modify the database and return the same result for the same input parameters within a single statement.
- IMMUTABLE: Use for functions that always return the same result for the same input parameters, regardless of any database changes.
This blog will focus on STABLE functions and how they can be leveraged to optimize SQL queries.
Example Setup
Creating Tables and Loading Data
First, the necessary tables are created, and data is loaded:
CREATE TABLE order_header (
order_id bigint NOT NULL,
request_id bigint NOT NULL,
CONSTRAINT order_header_pk PRIMARY KEY (order_id)
);
CREATE TABLE order_detail (
order_detail_id bigint NOT NULL,
order_id bigint NOT NULL,
amount numeric,
CONSTRAINT order_detail_pk PRIMARY KEY (order_detail_id),
CONSTRAINT order_detail_id_fk FOREIGN KEY (order_id) REFERENCES order_header(order_id)
);
ALTER TABLE order_header ADD CONSTRAINT order_detail_uk UNIQUE (order_id);
CREATE INDEX order_id_ix ON order_detail(order_id);
INSERT INTO order_header(order_id, request_id)
SELECT generate_series(1, 1000000), generate_series(1, 1000000);
INSERT INTO order_detail(order_id, order_detail_id, amount)
SELECT gs.id, gs.id1, round((random() * 50)::numeric, 2)
FROM (
SELECT generate_series(1, 1000000) id, generate_series(1, 1000000) id1
UNION ALL
SELECT generate_series(1, 1000000) id, generate_series(1000001, 2000000) id1
UNION ALL
SELECT generate_series(1, 1000000) id, generate_series(2000001, 3000000) id1
) AS gs;
Creating and Tuning Function
Creating Initial Function
A function is created to return the maximum order detail ID for a given order ID:
CREATE OR REPLACE FUNCTION getmaxid(in_orderid bigint)
RETURNS bigint
LANGUAGE plpgsql
AS $function$
DECLARE
v_max_order_detail_id bigint;
BEGIN
SET search_path TO public;
SELECT max(od.order_detail_id) INTO STRICT v_max_order_detail_id
FROM order_detail od
WHERE od.order_id = in_orderid;
RETURN v_max_order_detail_id;
EXCEPTION
WHEN others THEN RETURN 0;
END;
$function$;
Analyzing the performance of a query using this function:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT od.order_id, getmaxid(order_id) AS max_order_detail_id
FROM order_header od
ORDER BY order_id ASC
LIMIT 500000;
The initial query plan shows an execution time of approximately ~11.8 Seconds:
QUERY PLAN |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.42..135364.92 rows=500000 width=16) (actual time=0.137..11787.350 rows=500000 loops=1) |
Output: order_id, (getmaxid(order_id)) |
Buffers: shared hit=3001521 |
-> Index Only Scan using order_detail_uk on public.order_header od (cost=0.42..270729.42 rows=1000000 width=16) (actual time=0.136..11717.097 rows=500000 loops=1)|
Output: order_id, getmaxid(order_id) |
Heap Fetches: 27935 |
Buffers: shared hit=3001521 |
Query Identifier: 8878135702534082568 |
Planning Time: 0.091 ms |
Execution Time: 11841.208 ms |
Adding the STABLE Keyword
To improve performance, the STABLE keyword is added to the function:
CREATE OR REPLACE FUNCTION public.getmaxid_updated(in_orderid bigint)
RETURNS bigint
LANGUAGE plpgsql
AS $function$
DECLARE
v_max_order_detail_id bigint;
BEGIN
SELECT max(od.order_detail_id) INTO STRICT v_max_order_detail_id
FROM public.order_detail od
WHERE od.order_id = in_orderid;
RETURN v_max_order_detail_id;
EXCEPTION
WHEN others THEN RETURN 0;
END;
$function$ STABLE;
Note: Set Keyword cannot be used in a function using Stable keyword. For that reason, you can see function is complied with search path added with the function name itself.
Re-running the query with the updated function reduces the execution time to approximately ~9.1 seconds:
QUERY PLAN |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.42..135364.92 rows=500000 width=16) (actual time=0.147..9073.129 rows=500000 loops=1) |
Output: order_id, (getmaxid_updated(order_id)) |
Buffers: shared hit=3001521 |
-> Index Only Scan using order_detail_uk on public.order_header od (cost=0.42..270729.42 rows=1000000 width=16) (actual time=0.146..8997.843 rows=500000 loops=1)|
Output: order_id, getmaxid_updated(order_id) |
Heap Fetches: 27935 |
Buffers: shared hit=3001521 |
Query Identifier: 1508106681629434046 |
Planning Time: 0.084 ms |
Execution Time: 9106.193 ms |
Changing to a SQL Function
Further optimization is achieved by changing the function to a SQL function:
CREATE OR REPLACE FUNCTION public.getmaxid_updated_sql(in_orderid bigint)
RETURNS TABLE(max_order_detail_id bigint)
LANGUAGE sql
AS $function$
SELECT max(od.order_detail_id) AS max_order_detail_id
FROM public.order_detail od
WHERE od.order_id = in_orderid;
$function$ STABLE;
The query plan with the SQL function shows an execution time of approximately ~7.4 seconds:
QUERY PLAN |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.42..2638.29 rows=500000 width=16) (actual time=0.208..7461.859 rows=500000 loops=1) |
Output: order_id, (getmaxid_updated_sql(order_id)) |
Buffers: shared hit=3001521 |
-> ProjectSet (cost=0.42..5275729.42 rows=1000000000 width=16) (actual time=0.207..7397.261 rows=500000 loops=1) |
Output: order_id, getmaxid_updated_sql(order_id) |
Buffers: shared hit=3001521 |
-> Index Only Scan using order_detail_uk on public.order_header od (cost=0.42..20729.42 rows=1000000 width=8) (actual time=0.033..106.412 rows=500000 loops=1)|
Output: order_id |
Heap Fetches: 27935 |
Buffers: shared hit=1521 |
Query Identifier: -7313490988887048957 |
Planning Time: 0.087 ms |
Execution Time: 7489.221 ms |
The query execution time improved significantly from approximately 11.8 seconds to 7.4 seconds by the end of the third EXPLAIN ANALYZE output.
Conclusion
To tune SQL queries that use functions in PostgreSQL, the following steps can be taken:
- Add the STABLE Keyword: This informs PostgreSQL that the function will return the same result for the same input parameters, allowing for optimizations. For more information on function volatility in PostgreSQL, refer to the official PostgreSQL documentation.
- Use SQL Functions: Converting from PL/pgSQL to SQL functions wherever it is possible, can help improve performance.
- For best optimization results, you should label your functions with the strictest volatility category that is valid for what the code of the function needs to do.
By applying these techniques and following PostgreSQL's recommendations, significant improvements in query execution times can be observed.