Re: Bulk DML performance

From: Greg Sabino Mullane <htamfids(at)gmail(dot)com>
To: bill(dot)poole(at)ymail(dot)com
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Bulk DML performance
Date: 2025-03-17 14:14:14
Message-ID: CAKAnmmKvxMG7t8pv8FZPkrgU9Sh8V=6Y8HaK4QymK-ZA+i+mcA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Mon, Mar 17, 2025 at 4:19 AM <bill(dot)poole(at)ymail(dot)com> wrote:

Can you help me understand why performing 3 million lookups on a b-tree
> index with all pages cached in memory takes so long?

It's not the lookup, it's writing the 3 million rows (and in this
particular upsert case, deleting 3 million, then inserting 3 million)

> Well, that is not a great statement.
>
> Understood, but I was highlighting the performance of deleting 3 million
> rows identified by 3 million IDs, as opposed to deleting rows in a given
> range of IDs or deleting the whole table. It seems like deleting 3 million
> rows identified by 3 million IDs should be faster than updating 3 million
> rows (also identified by 3 million IDs).
>

It should indeed be faster. But keep in mind a delete immediately after
that upsert now has twice as many rows to walk through as the upsert did.
Also, a subselect like your original query can lead to a large nested loop.
Try another variant such as this one:

with ids as (select x from generate_series(1, 3_000_000) x) delete from
test using ids where id=x;

> With the table as it is you won't get better performance if you want the
> features that a relational database provides.
>
> Sorry to hear that. I had hoped there was room to improve this performance.
>

If pure upsert performance is the goal, remove the unique index and store a
timestamp along with your inserted data. Back to pure inserts again! (and a
few new downsides). When querying, only use the version of the row with the
highest timestamp.

Other random ideas:

* remove or consolidate columns you don't need, or can store in another
table
* pre-filter the rows in the app, so you can do a pure-insert (or COPY) of
known-to-be-new rows, then upsert the remaining rows
* use the smallest data types possible
* avoid or minimize toasted values
* pack your columns efficiently (e.g. reorder for 8 byte blocks)
* put the indexes on a ram-based tablespace
* boost your work_mem (for things like giant deletes which build hashes)
* revisit unlogged tables and partitioning

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message kimaidou 2025-03-17 14:41:07 Bad perf when using DECLARE CURSOR on big table
Previous Message Álvaro Herrera 2025-03-17 10:53:37 Re: Bulk DML performance