From: | Yang Zhang <yanghatespam(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Optimizing bulk update performance |
Date: | 2013-04-27 08:05:06 |
Message-ID: | CAKxBDU9vEbUoDYN7Pwe2SMrkJiifd5w3rtHXLqTuBkRiyJ=KZw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Sat, Apr 27, 2013 at 12:24 AM, Yang Zhang <yanghatespam(at)gmail(dot)com> wrote:
> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yang Zhang <yanghatespam(at)gmail(dot)com> writes:
>>> It currently takes up to 24h for us to run a large set of UPDATE
>>> statements on a database, which are of the form:
>>
>>> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE
>>> id = constid
>>
>>> (We're just overwriting fields of objects identified by ID.)
>>
>> Forgive the obvious question, but you do have an index on "id", right?
>> Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?
>
> Totally valid question. That is the primary key with its own index.
> Yes, we verified that explain says it just use a simple index scan.
> Each individual query runs reasonably quickly (we can run several
> dozen such statements per second).
>
>>
>>> The tables have handfuls of indices each and no foreign key constraints.
>>
>> How much is a "handful"?
>
> The table with the largest volume of updates (our bottleneck) has four indexes:
>
> "account_pkey" PRIMARY KEY, btree (id)
> "account_createddate" btree (createddate)
> "account_id_prefix" btree (id text_pattern_ops)
> "account_recordtypeid" btree (recordtypeid)
>
>>
>>> It takes 2h to import a `pg_dump` of the entire DB. This seems like a
>>> baseline we should reasonably target.
>>
>> Well, maybe. You didn't say what percentage of the DB you're updating.
>
> It can be 10-50% of rows changed - a large portion.
>
>>
>> But the thing that comes to mind here is that you're probably incurring
>> a network round trip for each row, and maybe a query-planning round as
>> well, so you really can't expect that this is going to be anywhere near
>> as efficient as a bulk load operation. You could presumably get rid of
>> the planner overhead by using a prepared statement. Cutting the network
>> overhead is going to require a bit more ingenuity --- could you move
>> some logic into a stored procedure, perhaps, so that one command from
>> the client is sufficient to update multiple rows?
>
> You're right, we're only sequentially issuing (unprepared) UPDATEs.
>
> If we ship many UPDATE statements per call to our DB API's execution
> function (we're using Python's psycopg2 if that matters, but I think
> that just binds libpq), would that avoid the network round trip per
> statement?
>
> If not, what if we use anonymous procedures (DO) to run multiple
> UPDATE statements?
>
> Finally, we could use the technique highlighted in my third bullet and
> use COPY (or at least multiple-value INSERT), then merging the new
> data with the old. Would that be the most direct route to maximum
> performance?
>
> In any case, I assume deleting and rebuilding indexes is important
> here, yes? But what about raising checkpoint_segments - does this
> actually help sustained throughput?
(I ask because I'm wondering if raising checkpoint_segments simply
postpones inevitable work, or if collecting a larger amount of changes
really does dramatically improve throughput somehow.)
>
>>
>> regards, tom lane
>
>
> --
> Yang Zhang
> http://yz.mit.edu/
--
Yang Zhang
http://yz.mit.edu/
From | Date | Subject | |
---|---|---|---|
Next Message | Misa Simic | 2013-04-27 08:55:24 | Re: Optimizing bulk update performance |
Previous Message | Yang Zhang | 2013-04-27 07:24:31 | Re: Optimizing bulk update performance |