RE: Bulk DML performance

From: <bill(dot)poole(at)ymail(dot)com>
To: "'Laurenz Albe'" <laurenz(dot)albe(at)cybertec(dot)at>, <pgsql-performance(at)postgresql(dot)org>
Subject: RE: Bulk DML performance
Date: 2025-03-13 10:13:08
Message-ID: 00a401db9400$86c240f0$9446c2d0$@ymail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks Laurenz for taking the time to look at this.

> That is unavoidable, because PostgreSQL adds a new version for each row to the table. To avoid that kind of bloat, you'd have to update in smaller batches and run VACUUM between those to free the "dead" row versions.

Yes, I would expect the number of blocks/MBs to double due to executing an update, but it is noteworthy that inserting 27 MB of data into a newly created table creates 191 MB of data including the index and 127 MB of data excluding the index.

> It is the index maintenance that is killing you.

Yes, but as you say I need the index to perform the INSERT ... ON CONFLICT (...) DO UPDATE. Can you help me understand why performing 3 million lookups on a b-tree index with all pages cached in memory takes so long?

> If you don't need crash safety, you could use UNLOGGED tables and be somewhat faster (5.8 seconds for the initial INSERT here).

Sadly, that is not an option for me. I'm building a production system.

> Well, that is not a great statement.

Understood, but I was highlighting the performance of deleting 3 million rows identified by 3 million IDs, as opposed to deleting rows in a given range of IDs or deleting the whole table. It seems like deleting 3 million rows identified by 3 million IDs should be faster than updating 3 million rows (also identified by 3 million IDs).

> With the table as it is you won't get better performance if you want the features that a relational database provides.

Sorry to hear that. I had hoped there was room to improve this performance.

> To get better performance, the best I can think of is to parallelize loading the data until you saturate CPU, disk or hit internal contention in the database.

Sadly, I cannot do that because I need all rows to be inserted in a single database transaction, which I cannot do over multiple database connections.

Regards,
Bill

-----Original Message-----
From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Sent: Thursday, 13 March 2025 5:21 PM
To: bill(dot)poole(at)ymail(dot)com; pgsql-performance(at)postgresql(dot)org
Subject: Re: Bulk DML performance

On Thu, 2025-03-13 at 12:05 +0800, bill(dot)poole(at)ymail(dot)com wrote:

> The following INSERT .. ON CONFLICT (…) DO UPDATE statement
> inserts/updates
> 3 million rows with only 9 bytes per row and takes about 8 seconds on
> first run (to insert the rows) and about 14 seconds on subsequent runs
> (to update the rows),

I can confirm these times on my not so new laptop with local NVME.
That's the time it takes if you have an index on the table and want to be crash safe.

> but is only inserting 27 MB of data (3 million rows with 9
> bytes per row); although after the first run, SELECT
> pg_size_pretty(pg_total_relation_size('test')) reports the table size
> as 191 MB and after the second run reports the table size as 382 MB
> (adding another 191 MB).

That is unavoidable, because PostgreSQL adds a new version for each row to the table. To avoid that kind of bloat, you'd have to update in smaller batches and run VACUUM between those to free the "dead" row versions.

> CREATE TABLE test (
> id bigint PRIMARY KEY,
> text1 text
> );
>
> INSERT INTO test (id, text1)
> SELECT generate_series, 'x'
> FROM generate_series(1, 3000000)
> ON CONFLICT (id) DO UPDATE
> SET text1 = 'x';
>
> If PostgreSQL is writing 191 MB on the first run and 382 MB on each
> subsequent run, then PostgreSQL is only writing about 28 MB/s.
> Although PostgreSQL is also able to write about 4.5 GB in about 35
> seconds (as stated above), which is about 128 MB/s, so it seems the
> performance constraint depends on the number of rows inserted more than the size of each row.

It is the index maintenance that is killing you.
Without the primary key, the first insert takes under 1.8 seconds here.
But you need the primary key index if you want to use INSERT ... ON CONFLICT.

The update has to do even more work, so it is slower.

If you don't need crash safety, you could use UNLOGGED tables and be somewhat faster (5.8 seconds for the initial INSERT here).

Essentially, the moderate performance is the price you are paying for data integrity (crash safety) and consistency (primary key).

> Furthermore, deleting the rows takes about 18 seconds to perform
> (about 4 seconds longer than the time taken to update the rows):
>
> DELETE FROM test
> WHERE id in (
> SELECT * FROM generate_series(1, 3000000)
> )

Well, that is not a great statement.

The following takes only 1.6 seconds here:

DELETE FROM test WHERE id BETWEEN 1 AND 3000000;

And if you want to delete all rows, TRUNCATE is very, very fast.

> It seems like it should be possible to do better than this on modern
> hardware, but I don’t have enough knowledge of the inner workings of
> PostgreSQL to know whether my instinct is correct on this, so I
> thought I’d raise the question with the experts.

With the table as it is you won't get better performance if you want the features that a relational database provides.

To get better performance, the best I can think of is to parallelize loading the data until you saturate CPU, disk or hit internal contention in the database.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Laurenz Albe 2025-03-13 11:28:09 Re: Bulk DML performance
Previous Message Laurenz Albe 2025-03-13 09:21:26 Re: Bulk DML performance