Re: Optimizing bulk update performance

From: Yang Zhang <yanghatespam(at)gmail(dot)com>
To: Misa Simic <misa(dot)simic(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Optimizing bulk update performance
Date: 2013-04-27 09:54:33
Message-ID: CAKxBDU8-MbW-ObseLc=ELGnXziu2-a3=jEog3xmWicCP66bupw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:
> Hi,
>
> If dataset for update is large...
>
> Maybe best would be:
>
> From client machine, instead of sending update statements with data - export
> data to file ready for copy command
> Transfer file to the server where pg is running
> Make pgsql function which
>
> Create temp table
> Copy to temp from the file
>
> Update original table with values in temp
>
> UPDATE foo
> SET foo.col1 = bar.col1
> FROM bar
> WHERE foo.id = bar.id
>
> You dont need to do delete/insert - if you have just update comands....
>
> From client when file is transfered - call your import function on the the
> server
>
> Optionaly you can run vacuum analyze after bulk operation...

But wouldn't a bulk UPDATE touch many existing pages (say, 20%
scattered around) to mark rows as dead (per MVCC)? I guess it comes
down to: will PG be smart enough to mark dead rows in largely
sequential scans (rather than, say, jumping around in whatever order
rows from foo are yielded by the above join)?

In other words, when considering the alternative of:

CREATE TABLE newfoo AS
SELECT * FROM bar
UNION
SELECT * FROM foo
WHERE id NOT IN (SELECT id FROM bar);

Wouldn't this alternative be faster?

>
> Kind regards,
>
> Misa
>
>
>
> On Saturday, April 27, 2013, Yang Zhang wrote:
>>
>> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> > Yang Zhang <yanghatespam(at)gmail(dot)com> writes:
>> >> It currently takes up to 24h for us to run a large set of UPDATE
>> >> statements on a database, which are of the form:
>> >
>> >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE
>> >> id = constid
>> >
>> >> (We're just overwriting fields of objects identified by ID.)
>> >
>> > Forgive the obvious question, but you do have an index on "id", right?
>> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?
>>
>> Totally valid question. That is the primary key with its own index.
>> Yes, we verified that explain says it just use a simple index scan.
>> Each individual query runs reasonably quickly (we can run several
>> dozen such statements per second).
>>
>> >
>> >> The tables have handfuls of indices each and no foreign key
>> >> constraints.
>> >
>> > How much is a "handful"?
>>
>> The table with the largest volume of updates (our bottleneck) has four
>> indexes:
>>
>> "account_pkey" PRIMARY KEY, btree (id)
>> "account_createddate" btree (createddate)
>> "account_id_prefix" btree (id text_pattern_ops)
>> "account_recordtypeid" btree (recordtypeid)
>>
>> >
>> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a
>> >> baseline we should reasonably target.
>> >
>> > Well, maybe. You didn't say what percentage of the DB you're updating.
>>
>> It can be 10-50% of rows changed - a large portion.
>>
>> >
>> > But the thing that comes to mind here is that you're probably incurring
>> > a network round trip for each row, and maybe a query-planning round as
>> > well, so you really can't expect that this is going to be anywhere near
>> > as efficient as a bulk load operation. You could presumably get rid of
>> > the planner overhead by using a prepared statement. Cutting the network
>> > overhead is going to require a bit more ingenuity --- could you move
>> > some logic into a stored procedure, perhaps, so that one command from
>> > the client is sufficient to update multiple rows?
>>
>> You're right, we're only sequentially issuing (unprepared) UPDATEs.
>>
>> If we ship many UPDATE statements per call to our DB API's execution
>> function (we're using Python's psycopg2 if that matters, but I think
>> that just binds libpq), would that avoid the network round trip per
>> statement?
>>
>> If not, what if we use anonymous procedures (DO) to run multiple
>> UPDATE statements?
>>
>> Finally, we could use the technique highlighted in my third bullet and
>> use COPY (or at least multiple-value INSERT), then merging the new
>> data with the old. Would that be the most direct route to maximum
>> performance?
>>
>> In any case, I assume deleting and rebuilding indexes is important
>> here, yes? But what about raising checkpoint_segments - does this
>> actually help sustained throughput?
>>
>> >
>> > regards, tom lane
>>
>>
>> --
>> Yang Zhang
>> http://yz.mit.edu/
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general

--
Yang Zhang
http://yz.mit.edu/

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Yang Zhang 2013-04-27 10:05:18 Re: Optimizing bulk update performance
Previous Message Misa Simic 2013-04-27 08:55:24 Re: Optimizing bulk update performance