Re: Batch update query performance

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: Hans Drexler *EXTERN* <Hans(dot)Drexler(at)HumanInference(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Batch update query performance
Date: 2014-04-07 12:53:33
Message-ID: 53429FCD.1030109@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 04/07/2014 03:06 PM, Albe Laurenz wrote:
> Hans Drexler wrote:
>> Postgres needs close to 50
>> minutes to process the same query on the same data. Sometimes, Postgres
>> needs more than 2 hours.
>>
>> The application performs an update query on every row
>> of the table. The exact SQL of this query is:
>>
>> update t67cdi_nl_cmp_descr set is_grc_002='Y'
>
> [...]
>
>> We tried removing all indexes. That reduces the runtime to ~3 minutes.
>> When we start to put indexes back, the run time of the query increases
>> again with each index added.
>
> Do I read that right that the duration of the update is reduced from
> 50 or 120 minutes to 3 when you drop all the indexes?

If that's true, you might be able to drop and re-create the indexes as
part of the same transaction, and come out ahead. DROP/CREATE INDEX is
transactional in PostgreSQL, so you can do:

BEGIN;
DROP INDEX index1;
...
DROP INDEX index15;
UPDATE t67cdi_nl_cmp_descr SET is_grc_002='Y'
CREATE INDEX index1 ...;
...
CREATE INDEX index15 ...;
COMMIT;

This will take an AccessExclusiveLock on the table, though, so the table
will be inaccessible to concurrent queries while it's running.

Actually, since you are effectively rewriting the table anyway, you
could create a new table with same structure, insert all rows from the
old table, with is_grc_002 set to 'Y', drop the old table, and rename
the new table into its place.

Do all the rows really need to be updated? If some of the rows already
have is_grc_002='Y', you can avoid rewriting those rows by adding a
WHERE-clause: WHERE NOT is_grc_002='Y' OR is_grc_002 IS NULL.

You could also play tricks with partitioning. Don't store the is_grc_002
row in the table at all. Instead, create two tables, one for the rows
that implicitly have is_grc_002='Y' and another for all the other rows.
Then create a view on the union of the two tables, which adds the
is_grc_002 column. Instead of doing a full-table update, you can just
alter the view to display is_grc_002='Y' for both tables (and add a new
table to hold new rows with is_grc_002<>'Y').

>> Hypothesis
>> we have tried many things to solve this problem ourselves, but to no
>> avail so far. Our hypothesis is that
>> the Postgres creates new records for all rows and then needs to update
>> all 15 indexes to make them point to the new rows. There does not seem
>> to be a way to avoid that.
>>
>> Question:
>> - Is our hypothesis correct?
>> - Can the forum please advise us on possible ways to make the query
>> faster?
>
> Your hypothesis may be correct.

Yeah, sounds about right. A full-table UPDATE like that is pretty much
the worst-case scenario for PostgreSQL's MVCC system, unfortunately.

- Heikki

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Graeme B. Bell 2014-04-07 12:59:25 Re: PGSQL 9.3 - Materialized View - multithreading
Previous Message Nicolas Paris 2014-04-07 12:49:12 Re: PGSQL 9.3 - Materialized View - multithreading