Re: Bulk DML performance

From: Renan Alves Fonseca <renanfonseca(at)gmail(dot)com>
To: bill(dot)poole(at)ymail(dot)com, pgsql-performance(at)postgresql(dot)org
Subject: Re: Bulk DML performance
Date: 2025-03-17 17:45:50
Message-ID: CAN_p2QjrVE1bUW5qoewmX-5jckJiUK=gCigkgdorgZpJrY-f8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,
Here are some observations.

Em seg., 17 de mar. de 2025 às 09:19, <bill(dot)poole(at)ymail(dot)com> escreveu:

> > PostgreSQL has a lot of overhead per row.
>
> Okay, thanks. I'm not actually too worried about this since in my
> scenario, each row is about 1.5 kB, so the % overhead is negligible.
>
> > It is probably not the lookup, but the *modification* of the index that
> is slow.
>
> Yes that makes sense for the original 3 million inserts, but when I
> perform the update of the 3 million rows, the index doesn't change - they
> are all HOT updates.
>

Using "perf" I can see that the overhead is indeed due to index lookup when
we do HOT updates.

> > Then the best you can do is to use COPY rather than INSERT. It will
> perform better (but [not] vastly better).
>
> I need to perform a merge (INSERT ... ON CONFLICT ... DO UPDATE) on the
> data, so sadly I cannot use COPY.
>
> I have discovered that for some reason, performing the original insert
> without the ON CONFLICT statement is twice as fast as performing the
> original insert with an ON CONFLICT ... DO UPDATE clause, completing in 4
> seconds instead of 8. That seems strange to me because I wouldn't have
> thought it would be doing any additional work since a unique constraint is
> on the primary key, so each inserted value would need to be checked in
> either case, and there is no extra work to be done in either case.
>
>
In the INSERT case, we do not check the unique constraint for each row. We
run into an error when inserting a duplicate, aborting the operation.

> INSERT INTO test (id, text1)
> SELECT generate_series, 'x'
> FROM generate_series(1, 3000000)
>
> It remains 4 seconds even when adding a clause to not insert duplicates.
>
> INSERT INTO test (id, text1)
> SELECT generate_series, 'x'
> FROM generate_series(1, 3000000)
> WHERE NOT EXISTS (
> SELECT 1
> FROM test4
> WHERE id = generate_series
> )
>
>
In this case, we are not checking duplicates inside the input dataset. If
you can guarantee, at the application level, that there are no duplicates,
this seems a good speedup. Perhaps the MERGE clause...

> Furthermore, I have found that performing an UPDATE rather than an INSERT
> ... ON CONFLICT ... DO UPDATE is twice as slow, completing in 16 seconds
> instead of 14 seconds.
>
> UPDATE test
> SET text1 = 'x'
> FROM generate_series(1, 3000000)
> WHERE test4.id = generate_series
>
> This also now means that updating 3 million rows takes 4x longer than
> inserting those rows. Do we expect updates to be 4x slower than inserts?
>
>
It is not the update that is slower. It is the attached where clause that
makes it slower. Try:
UPDATE test SET text1='x';

In my tests, the update of non-indexed columns is slightly faster than an
insert.

Regards,
Renan Fonseca

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Pavel Stehule 2025-03-17 18:32:59 Re: Re: proposal: schema variables
Previous Message Tom Lane 2025-03-17 15:22:57 Re: Bad perf when using DECLARE CURSOR on big table