Discover how speeding up insert operations can elevate your PostgreSQL database efficiency and keep your systems running smoothly.
Inserts are the backbone of any database and speeding them up is a fantastic way to kickstart your database performance. Optimizing insert operations can significantly enhance the overall performance of a PostgreSQL database. By implementing techniques such as using unlogged tables, leveraging the COPY command for bulk data loading, batching inserts, and disabling constraints, triggers, and indexes before inserts, you can achieve faster data insertion rates. These optimizations are particularly beneficial for daily imports and transactional inserts, as they reduce the time and resources required to process large volumes of data. Additionally, these methods help in minimizing the impact on the database's overall performance, ensuring that other operations can continue to run smoothly.
Furthermore, tuning server parameters plays a crucial role in optimizing insert operations. Adjusting settings such as wal_buffers, shared_buffers, max_wal_size, and checkpoint_timeout can significantly improve the efficiency of bulk data loads. For instance, increasing wal_buffers and max_wal_size during bulk uploads can help in managing the write-ahead log more effectively, while setting shared_buffers to 40% of the available RAM ensures that the database has enough memory to handle large transactions. Disabling autovacuum during data loads and performing a manual VACUUM ANALYZE post-load can also enhance performance by reducing the overhead associated with automatic maintenance tasks. By carefully tuning these parameters, you can ensure that your PostgreSQL database can handle large-scale data loads with minimal latency and maximum throughput.
In the upcoming sections of this blog, we will delve into the various tips, tricks, and tweaks mentioned above. I will provide you with practical code snippets to demonstrate their effectiveness and discuss key scenarios where each of these techniques can significantly boost performance. By the end of this blog, you will have a comprehensive understanding of how to optimize insert operations in PostgreSQL and enhance your database's overall efficiency. Stay tuned for detailed insights and actionable advice!
Creating Unlogged Tables
Unlogged tables do not generate WAL files, speeding up inserts significantly. By skipping the WAL process using unlogged tables will significantly improve the performance of the insert operation. This is because the database does not need to write the changes to the log before applying them to the table, reducing the I/O overhead and allowing for faster data insertion. However, it is important to note that while this can enhance performance, it comes with the trade-off of potential data loss in the event of a crash, as the changes are not logged and therefore cannot be recovered. This makes unlogged tables particularly useful for scenarios where performance is critical and data durability is less of a concern, such as temporary data storage or staging tables for bulk data loads.
The below tests are performed on a 4 Vcore with 512 GB Storage.
Below is the code snippet of how an insert is faster in an unlogged table versus logged table.
CREATE UNLOGGED TABLE IF NOT EXISTS public.unlogged_table
( a1 integer, b1 integer, c1 integer, d1 integer, e1 text COLLATE pg_catalog."default", f1 text COLLATE pg_catalog."default", g1 date)
CREATE TABLE IF NOT EXISTS public.logged_table
( a1 integer, b1 integer, c1 integer, d1 integer, e1 text COLLATE pg_catalog."default", f1 text COLLATE pg_catalog."default", g1 date)
INSERT INTO unlogged_table select generate_series(1,20000000),
generate_series(1,20000000),
generate_series(1,20000000),
generate_series(1,20000000),
'abcdefghijklmnopqrstuvwxyz',
'abcdefghijklmnopqrstuvwxyz',
now();
INSERT 0 20000000
Time: 12870.964 ms (00:12.871)
INSERT INTO logged_table select generate_series(1,20000000),
generate_series(1,20000000),
generate_series(1,20000000),
generate_series(1,20000000),
'abcdefghijklmnopqrstuvwxyz',
'abcdefghijklmnopqrstuvwxyz',
now();
INSERT 0 20000000
Time: 41248.696 ms (00:41.249)
Creating an unlogged table and inserting the data is 5 times faster than loading the data in a logged table.
Command:
CREATE UNLOGGED TABLE <tablename>;
ALTER TABLE <tablename> SET UNLOGGED;
Pros: Improves performance by 5X.
Cons: For applications requiring High Availability and data redundancy (Read Replica), unlogged tables may not be suitable due to their limitations in replication and crash recovery.
Using COPY Command
The COPY command in PostgreSQL is highly efficient because it bypasses many SQL processes, transferring data directly from the file system to the table. This direct transfer significantly reduces server stress and logging, making it an ideal choice for bulk data loading operations. By minimizing the overhead associated with traditional insert operations, the COPY command allows for faster and more efficient data insertion, which is particularly beneficial for large-scale data imports and high-volume transactional systems. This method ensures that the database can handle substantial data loads with minimal latency and maximum throughput, enhancing overall performance and reliability.
psql -h xxx.postgres.database.azure.com -U xxx -d <dbname> -f copy_in.sql
copy_in.sql script contains
\copy example_table FROM '/path/to/your/file.csv' DELIMITER ',' CSV HEADER;
Command:
\copy <tablename> FROM '/path/to/your/file.csv' DELIMITER ',' CSV HEADER;
Pros: Faster and more efficient, no network latency.
Cons: Risky as it generates one log for the entire transaction.
Batch Inserts vs. Single Commits
Batching multiple inserts in PostgreSQL can significantly enhance the performance of your database operations compared to executing single row commits. This is because batching reduces the overhead associated with each individual insert operation, such as transaction management and logging. When you batch inserts, you minimize the number of round trips between the application and the database, which can lead to substantial performance gains, especially when dealing with large volumes of data.
For example, consider a scenario where you need to insert 10,000 rows into a table. If you insert each row individually, the database must handle 10,000 separate transactions, each transaction with its own overhead. However, if you batch these inserts into groups of 1,000 rows, the database only needs to handle 10 transactions, significantly reducing the overhead and improving performance.
Here is a script that demonstrates how to batch multiple inserts in PostgreSQL:
-- Create a sample table and insert 3 rows with 3 insert commands
CREATE TABLE example_table (
id SERIAL PRIMARY KEY,
data TEXT);
1. Single Row Inserts with Individual Commits
INSERT INTO example_table (data) VALUES ('Row 1'); INSERT INTO example_table (data) VALUES ('Row 2'); INSERT INTO example_table (data) VALUES ('Row 3');
In this approach, each insert command is executed separately, and PostgreSQL performs a commit operation after each insert. This means that for every row inserted, the database has to go through the process of committing the transaction. This can add significant latency because the commit process involves writing to the transaction log, ensuring data durability, and updating the database state. As a result, this method can be less efficient and slower, especially when inserting a large number of rows.
2. Multiple Row Inserts in a Single Command
INSERT INTO example_table (data) VALUES ('Row 1'), ('Row 2'), ('Row 3');
In this approach, multiple rows are inserted in a single insert command, and PostgreSQL performs the commit operation only once for all the rows. This reduces the overall latency because the commit process is executed fewer times. By batching the inserts into a single command, the database can optimize the operation, reducing the overhead associated with multiple commits. This method is more efficient and faster compared to inserting single rows with individual commits.
3. Multiple Row Inserts within a Transaction
BEGIN; INSERT INTO example_table (data) VALUES ('Row 1'); INSERT INTO example_table (data) VALUES ('Row 2'); INSERT INTO example_table (data) VALUES ('Row 3'); COMMIT;
In this approach, multiple insert commands are executed within a single transaction. The transaction begins with the BEGIN statement and ends with the COMMIT statement. All the insert commands are executed as part of the same transaction, and PostgreSQL performs the commit operation only once at the end of the transaction. This method combines the benefits of batching inserts and reducing the number of commits. It ensures data consistency and durability while minimizing the latency associated with multiple commits. This approach is also efficient and can be used to group multiple operations into a single transaction, improving overall performance.
For larger datasets, you can automate the batching process in your application code. Here is an example in Python using the psycopg library:
import psycopg
# Connect to the PostgreSQL database
conn = psycopg.connect( dbname="xxxx", user="xxxx", password="xxxx", host="xxxx", port="5432", autocommit=True )
# Create a cursor object
cur = conn.cursor()
# Sample data to insert
data = [("Row " + str(i),) for i in range(1, 10001)]
# Insert data in batches using executemany
insert_query = "INSERT INTO example_table (data) VALUES (%s)" cur.executemany(insert_query, data)
conn.commit()
# Close the cursor and connection
cur.close()
conn.close()
In this Python script, we connect to the PostgreSQL database, prepare the data to be inserted, and then insert the data in batches of 1,000 rows. By committing each batch separately, we achieve better performance compared to inserting each row individually.
By batching multiple inserts, you can significantly improve the performance of your PostgreSQL database, making it more efficient and capable of handling large-scale data loads with ease.
Pros:
-
- Multi-row inserts are faster than single-row inserts.
- Batch inserts improve performance by 100X compared to single-row inserts.
Other factors that impact on the inserts and make it faster
Removing/Disabling Constraints/Triggers/Indexes
-
- Drop indexes, constraints, and triggers before data load.
- Recreate them post data load.
Partitioning Table
-
- Improves insert speed by enabling parallel inserts on different partitions.
- Helpful for incremental/daily imports.
HA/Read Replica
-
- Disable HA during initial loads and enable it after completion.
- Create read replicas post initial load.
Tuning Server Parameters
-
- wal_buffers: Set to higher value during bulk uploads.
- shared_buffers: Set to 40% of RAM.
- max_wal_size: Increase during bulk uploads.
- autovacuum: Turn off during load, perform manual VACUUM ANALYZE post load.
- checkpoint_timeout: Increase during bulk uploads.
Conclusion
By implementing these optimization techniques, you can significantly enhance the performance of your PostgreSQL database. Whether it's through the use of unlogged tables, the COPY command, batch inserts, or tuning server parameters, each method offers unique benefits that contribute to faster data insertion rates and overall database efficiency.