From: | Renan Alves Fonseca <renanfonseca(at)gmail(dot)com> |
---|---|
To: | bill(dot)poole(at)ymail(dot)com, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Bulk DML performance |
Date: | 2025-03-17 17:45:50 |
Message-ID: | CAN_p2QjrVE1bUW5qoewmX-5jckJiUK=gCigkgdorgZpJrY-f8g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hi,
Here are some observations.
Em seg., 17 de mar. de 2025 às 09:19, <bill(dot)poole(at)ymail(dot)com> escreveu:
> > PostgreSQL has a lot of overhead per row.
>
> Okay, thanks. I'm not actually too worried about this since in my
> scenario, each row is about 1.5 kB, so the % overhead is negligible.
>
> > It is probably not the lookup, but the *modification* of the index that
> is slow.
>
> Yes that makes sense for the original 3 million inserts, but when I
> perform the update of the 3 million rows, the index doesn't change - they
> are all HOT updates.
>
Using "perf" I can see that the overhead is indeed due to index lookup when
we do HOT updates.
> > Then the best you can do is to use COPY rather than INSERT. It will
> perform better (but [not] vastly better).
>
> I need to perform a merge (INSERT ... ON CONFLICT ... DO UPDATE) on the
> data, so sadly I cannot use COPY.
>
> I have discovered that for some reason, performing the original insert
> without the ON CONFLICT statement is twice as fast as performing the
> original insert with an ON CONFLICT ... DO UPDATE clause, completing in 4
> seconds instead of 8. That seems strange to me because I wouldn't have
> thought it would be doing any additional work since a unique constraint is
> on the primary key, so each inserted value would need to be checked in
> either case, and there is no extra work to be done in either case.
>
>
In the INSERT case, we do not check the unique constraint for each row. We
run into an error when inserting a duplicate, aborting the operation.
> INSERT INTO test (id, text1)
> SELECT generate_series, 'x'
> FROM generate_series(1, 3000000)
>
> It remains 4 seconds even when adding a clause to not insert duplicates.
>
> INSERT INTO test (id, text1)
> SELECT generate_series, 'x'
> FROM generate_series(1, 3000000)
> WHERE NOT EXISTS (
> SELECT 1
> FROM test4
> WHERE id = generate_series
> )
>
>
In this case, we are not checking duplicates inside the input dataset. If
you can guarantee, at the application level, that there are no duplicates,
this seems a good speedup. Perhaps the MERGE clause...
> Furthermore, I have found that performing an UPDATE rather than an INSERT
> ... ON CONFLICT ... DO UPDATE is twice as slow, completing in 16 seconds
> instead of 14 seconds.
>
> UPDATE test
> SET text1 = 'x'
> FROM generate_series(1, 3000000)
> WHERE test4.id = generate_series
>
> This also now means that updating 3 million rows takes 4x longer than
> inserting those rows. Do we expect updates to be 4x slower than inserts?
>
>
It is not the update that is slower. It is the attached where clause that
makes it slower. Try:
UPDATE test SET text1='x';
In my tests, the update of non-indexed columns is slightly faster than an
insert.
Regards,
Renan Fonseca
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2025-03-17 18:32:59 | Re: Re: proposal: schema variables |
Previous Message | Tom Lane | 2025-03-17 15:22:57 | Re: Bad perf when using DECLARE CURSOR on big table |