From: | <bill(dot)poole(at)ymail(dot)com> |
---|---|
To: | "'Renan Alves Fonseca'" <renanfonseca(at)gmail(dot)com> |
Cc: | <pgsql-performance(at)postgresql(dot)org> |
Subject: | RE: Bulk DML performance |
Date: | 2025-03-14 01:20:05 |
Message-ID: | 00cb01db947f$39bc49c0$ad34dd40$@ymail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Thanks Renan! Reducing the fill factor has improved my update performance and I am now seeing the same time for updates as inserts.
I look forward to any advancements PostgreSQL may make in the future to improve the performance of bulk DML operations. It would be amazing if they could be parallelized in the future.
Best,
Bill
From: Renan Alves Fonseca <renanfonseca(at)gmail(dot)com>
Sent: Friday, 14 March 2025 5:25 AM
To: bill(dot)poole(at)ymail(dot)com
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Bulk DML performance
Hello,
Regarding the additional time for UPDATE, you can try the following:
CREATE TABLE test3 (
id bigint PRIMARY KEY,
text1 text
) WITH (fillfactor=30);
See: https://www.postgresql.org/docs/17/storage-hot.html
My local test gives me almost the same time for INSERT (first insert) and UPDATES (following upserts).
Regarding the overall problem, there is always room for improvement. I did a quick test with partitions, and I found out that Postgres will not parallelize the upserts for us. One solution could be to partition the records at the application level, creating one connection per partition. On the DB side, the partitions can be implemented as standard tables (using a union view on top of them) or actual partitions of a main table. However, this solution does not strictly respect the "one single transaction'"constraint...
Regards,
Renan Fonseca
Em qui., 13 de mar. de 2025 às 08:40, <bill(dot)poole(at)ymail(dot)com <mailto:bill(dot)poole(at)ymail(dot)com> > escreveu:
Hello! I’m building a system that needs to insert/update batches of millions of rows (using INSERT .. ON CONFLICT (…) DO UPDATE) in a single database transaction, where each row is about 1.5 kB. The system produces about 3 million rows (about 4.5 GB) of data in about 5 seconds, but PostgreSQL takes about 35 seconds to insert that data and about 55 seconds to update that data. This is both on my local dev machine as well as on a large AWS Aurora PostgreSQL instance (db.r8g.16xlarge with 64 vCPUs, 512 GB RAM and 30 Gbps).
The following INSERT .. ON CONFLICT (…) DO UPDATE statement inserts/updates 3 million rows with only 9 bytes per row and takes about 8 seconds on first run (to insert the rows) and about 14 seconds on subsequent runs (to update the rows), but is only inserting 27 MB of data (3 million rows with 9 bytes per row); although after the first run, SELECT pg_size_pretty(pg_total_relation_size('test')) reports the table size as 191 MB and after the second run reports the table size as 382 MB (adding another 191 MB).
CREATE TABLE test (
id bigint PRIMARY KEY,
text1 text
);
INSERT INTO test (id, text1)
SELECT generate_series, 'x'
FROM generate_series(1, 3000000)
ON CONFLICT (id) DO UPDATE
SET text1 = 'x';
If PostgreSQL is writing 191 MB on the first run and 382 MB on each subsequent run, then PostgreSQL is only writing about 28 MB/s. Although PostgreSQL is also able to write about 4.5 GB in about 35 seconds (as stated above), which is about 128 MB/s, so it seems the performance constraint depends on the number of rows inserted more than the size of each row.
Furthermore, deleting the rows takes about 18 seconds to perform (about 4 seconds longer than the time taken to update the rows):
DELETE FROM test
WHERE id in (
SELECT * FROM generate_series(1, 3000000)
)
It seems like it should be possible to do better than this on modern hardware, but I don’t have enough knowledge of the inner workings of PostgreSQL to know whether my instinct is correct on this, so I thought I’d raise the question with the experts.
Thanks!
Bill
From | Date | Subject | |
---|---|---|---|
Next Message | Álvaro Herrera | 2025-03-17 10:53:37 | Re: Bulk DML performance |
Previous Message | Renan Alves Fonseca | 2025-03-13 21:25:10 | Re: Bulk DML performance |