From: | Yang Zhang <yanghatespam(at)gmail(dot)com> |
---|---|
To: | Misa Simic <misa(dot)simic(at)gmail(dot)com> |
Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Optimizing bulk update performance |
Date: | 2013-04-27 10:05:18 |
Message-ID: | CAKxBDU94Z5qowUBAkbxbb2N4GCfJn7+TGsq6fm5e4fgj8QRkeA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Sat, Apr 27, 2013 at 2:54 AM, Yang Zhang <yanghatespam(at)gmail(dot)com> wrote:
> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:
>> Hi,
>>
>> If dataset for update is large...
>>
>> Maybe best would be:
>>
>> From client machine, instead of sending update statements with data - export
>> data to file ready for copy command
>> Transfer file to the server where pg is running
>> Make pgsql function which
>>
>> Create temp table
>> Copy to temp from the file
>>
>> Update original table with values in temp
>>
>> UPDATE foo
>> SET foo.col1 = bar.col1
>> FROM bar
>> WHERE foo.id = bar.id
>>
>> You dont need to do delete/insert - if you have just update comands....
>>
>> From client when file is transfered - call your import function on the the
>> server
>>
>> Optionaly you can run vacuum analyze after bulk operation...
>
> But wouldn't a bulk UPDATE touch many existing pages (say, 20%
> scattered around) to mark rows as dead (per MVCC)? I guess it comes
> down to: will PG be smart enough to mark dead rows in largely
> sequential scans (rather than, say, jumping around in whatever order
> rows from foo are yielded by the above join)?
(This then begs the question - how might I see this seemingly
substantial performance implication, one way or the other, in say
EXPLAIN output or something like that?)
>
> In other words, when considering the alternative of:
>
> CREATE TABLE newfoo AS
> SELECT * FROM bar
> UNION
> SELECT * FROM foo
> WHERE id NOT IN (SELECT id FROM bar);
>
> Wouldn't this alternative be faster?
>
>>
>> Kind regards,
>>
>> Misa
>>
>>
>>
>> On Saturday, April 27, 2013, Yang Zhang wrote:
>>>
>>> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> > Yang Zhang <yanghatespam(at)gmail(dot)com> writes:
>>> >> It currently takes up to 24h for us to run a large set of UPDATE
>>> >> statements on a database, which are of the form:
>>> >
>>> >> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE
>>> >> id = constid
>>> >
>>> >> (We're just overwriting fields of objects identified by ID.)
>>> >
>>> > Forgive the obvious question, but you do have an index on "id", right?
>>> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?
>>>
>>> Totally valid question. That is the primary key with its own index.
>>> Yes, we verified that explain says it just use a simple index scan.
>>> Each individual query runs reasonably quickly (we can run several
>>> dozen such statements per second).
>>>
>>> >
>>> >> The tables have handfuls of indices each and no foreign key
>>> >> constraints.
>>> >
>>> > How much is a "handful"?
>>>
>>> The table with the largest volume of updates (our bottleneck) has four
>>> indexes:
>>>
>>> "account_pkey" PRIMARY KEY, btree (id)
>>> "account_createddate" btree (createddate)
>>> "account_id_prefix" btree (id text_pattern_ops)
>>> "account_recordtypeid" btree (recordtypeid)
>>>
>>> >
>>> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like a
>>> >> baseline we should reasonably target.
>>> >
>>> > Well, maybe. You didn't say what percentage of the DB you're updating.
>>>
>>> It can be 10-50% of rows changed - a large portion.
>>>
>>> >
>>> > But the thing that comes to mind here is that you're probably incurring
>>> > a network round trip for each row, and maybe a query-planning round as
>>> > well, so you really can't expect that this is going to be anywhere near
>>> > as efficient as a bulk load operation. You could presumably get rid of
>>> > the planner overhead by using a prepared statement. Cutting the network
>>> > overhead is going to require a bit more ingenuity --- could you move
>>> > some logic into a stored procedure, perhaps, so that one command from
>>> > the client is sufficient to update multiple rows?
>>>
>>> You're right, we're only sequentially issuing (unprepared) UPDATEs.
>>>
>>> If we ship many UPDATE statements per call to our DB API's execution
>>> function (we're using Python's psycopg2 if that matters, but I think
>>> that just binds libpq), would that avoid the network round trip per
>>> statement?
>>>
>>> If not, what if we use anonymous procedures (DO) to run multiple
>>> UPDATE statements?
>>>
>>> Finally, we could use the technique highlighted in my third bullet and
>>> use COPY (or at least multiple-value INSERT), then merging the new
>>> data with the old. Would that be the most direct route to maximum
>>> performance?
>>>
>>> In any case, I assume deleting and rebuilding indexes is important
>>> here, yes? But what about raising checkpoint_segments - does this
>>> actually help sustained throughput?
>>>
>>> >
>>> > regards, tom lane
>>>
>>>
>>> --
>>> Yang Zhang
>>> http://yz.mit.edu/
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>
>
>
> --
> Yang Zhang
> http://yz.mit.edu/
--
Yang Zhang
http://yz.mit.edu/
From | Date | Subject | |
---|---|---|---|
Next Message | Misa Simic | 2013-04-27 10:06:07 | Re: Optimizing bulk update performance |
Previous Message | Yang Zhang | 2013-04-27 09:54:33 | Re: Optimizing bulk update performance |