From: | <bill(dot)poole(at)ymail(dot)com> |
---|---|
To: | "'Laurenz Albe'" <laurenz(dot)albe(at)cybertec(dot)at>, <pgsql-performance(at)postgresql(dot)org> |
Subject: | RE: Bulk DML performance |
Date: | 2025-03-13 13:04:53 |
Message-ID: | 00b001db9418$85c7c770$91575650$@ymail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
> PostgreSQL has a lot of overhead per row.
Okay, thanks. I'm not actually too worried about this since in my scenario, each row is about 1.5 kB, so the % overhead is negligible.
> It is probably not the lookup, but the *modification* of the index that is slow.
Yes that makes sense for the original 3 million inserts, but when I perform the update of the 3 million rows, the index doesn't change - they are all HOT updates.
> Then the best you can do is to use COPY rather than INSERT. It will perform better (but [not] vastly better).
I need to perform a merge (INSERT ... ON CONFLICT ... DO UPDATE) on the data, so sadly I cannot use COPY.
I have discovered that for some reason, performing the original insert without the ON CONFLICT statement is twice as fast as performing the original insert with an ON CONFLICT ... DO UPDATE clause, completing in 4 seconds instead of 8. That seems strange to me because I wouldn't have thought it would be doing any additional work since a unique constraint is on the primary key, so each inserted value would need to be checked in either case, and there is no extra work to be done in either case.
INSERT INTO test (id, text1)
SELECT generate_series, 'x'
FROM generate_series(1, 3000000)
It remains 4 seconds even when adding a clause to not insert duplicates.
INSERT INTO test (id, text1)
SELECT generate_series, 'x'
FROM generate_series(1, 3000000)
WHERE NOT EXISTS (
SELECT 1
FROM test4
WHERE id = generate_series
)
Furthermore, I have found that performing an UPDATE rather than an INSERT ... ON CONFLICT ... DO UPDATE is twice as slow, completing in 16 seconds instead of 14 seconds.
UPDATE test
SET text1 = 'x'
FROM generate_series(1, 3000000)
WHERE test4.id = generate_series
This also now means that updating 3 million rows takes 4x longer than inserting those rows. Do we expect updates to be 4x slower than inserts?
Regards,
Bill
-----Original Message-----
From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Sent: Thursday, 13 March 2025 7:28 PM
To: bill(dot)poole(at)ymail(dot)com; pgsql-performance(at)postgresql(dot)org
Subject: Re: Bulk DML performance
On Thu, 2025-03-13 at 18:13 +0800, bill(dot)poole(at)ymail(dot)com wrote:
>
> it is noteworthy that inserting 27 MB of data into a newly created
> table creates
> 191 MB of data including the index and 127 MB of data excluding the index.
PostgreSQL has a lot of overhead per row.
>
> Can you help me understand why performing 3 million lookups on a
> b-tree index with all pages cached in memory takes so long?
It is probably not the lookup, but the *modification* of the index that is slow.
>
> It seems like deleting 3 million rows identified by 3 million IDs
> should be faster than updating 3 million rows (also identified by 3 million IDs).
It should be, yes.
To find out where the time is spent, use EXPLAIN (ANALYZE, BUFFERS) on the statement.
> > To get better performance, the best I can think of is to parallelize
> > loading the data until you saturate CPU, disk or hit internal contention in the database.
>
> Sadly, I cannot do that because I need all rows to be inserted in a
> single database transaction, which I cannot do over multiple database connections.
Then the best you can do is to use COPY rather than INSERT.
It will perform better (but now vastly better).
Yours,
Laurenz Albe
From | Date | Subject | |
---|---|---|---|
Next Message | Renan Alves Fonseca | 2025-03-13 21:25:10 | Re: Bulk DML performance |
Previous Message | Laurenz Albe | 2025-03-13 11:29:23 | Re: Bulk DML performance |