From: | Yang Zhang <yanghatespam(at)gmail(dot)com> |
---|---|
To: | Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Optimizing bulk update performance |
Date: | 2013-04-27 02:30:05 |
Message-ID: | CAKxBDU-G9q+8Ne2Tn5QJEy8m+L2DjdfNH1i4KA0NtbVXbiV5rA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, Apr 26, 2013 at 7:01 PM, Gavin Flower
<GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:
> Please do not top post, the convention in these list are to add stuff at the
> end, apart from comments interspersed to make use of appropriate context!
Noted, thanks.
Anyway, any performance hints are greatly appreciated.
>
>
> On 27/04/13 13:35, Yang Zhang wrote:
>>
>> We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
>> m1.xlarge instances, which have:
>>
>> 15 GiB memory
>> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
>> 64-bit platform
>>
>> (Yes, we're moving to EBS Optimized instances + Provisioned IOPS
>> volumes, but prelim. benchmarks suggest this won't get us enough of a
>> boost as much as possibly refactoring the way we're executing these
>> bulk updates in our application.)
>>
>> On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
>> <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:
>>>
>>> On 27/04/13 12:14, Yang Zhang wrote:
>>>
>>> It currently takes up to 24h for us to run a large set of UPDATE
>>> statements on a database, which are of the form:
>>>
>>> UPDATE table SET field1 = constant1, field2 = constant2, ... WHERE
>>> id = constid
>>>
>>> (We're just overwriting fields of objects identified by ID.)
>>>
>>> The tables have handfuls of indices each and no foreign key constraints.
>>> No COMMIT is made till the end.
>>>
>>> It takes 2h to import a `pg_dump` of the entire DB. This seems like a
>>> baseline we should reasonably target.
>>>
>>> Short of producing a custom program that somehow reconstructs a dataset
>>> for Postgresql to re-import, is there anything we can do to bring the
>>> bulk UPDATE performance closer to that of the import? (This is an area
>>> that we believe log-structured merge trees handle well, but we're
>>> wondering if there's anything we can do within Postgresql.)
>>>
>>> Some ideas:
>>>
>>> - dropping all non-ID indices and rebuilding afterward?
>>> - increasing checkpoint_segments, but does this actually help sustained
>>> long-term throughput?
>>> - using the techniques mentioned here? (Load new data as table, then
>>> "merge in" old data where ID is not found in new data)
>>>
>>>
>>> <http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9668@mail.gmail.com>
>>>
>>> Basically there's a bunch of things to try and we're not sure what the
>>> most effective are or if we're overlooking other things. We'll be
>>> spending the next few days experimenting, but we thought we'd ask here
>>> as well.
>>>
>>> Thanks.
>>>
>>>
>>> People will need to know your version of Postgres & which Operating
>>> System
>>> etc. plus details of CPU RAM, and Disks... AS well as what changes you
>>> have
>>> made to postgresql.conf...
>>>
>>> I would be inclined to DROP all indexes and reCREATE them later.
>>>
>>> Updating a row might lead to new row being added in a new disk page, so I
>>> suspect that updates will hit every index associated with the table with
>>> the
>>> (possible exception of partial indexes).
>>>
>>> Running too many updates in one transaction, may mean that Postgres may
>>> need
>>> to use disk work files.
>>>
>>> Depending on RAM etc, it may pay to increase some variables tat affect
>>> how
>>> Postgres uses RAM, some of these are per session.
>>>
>>>
>>> Cheers,
>>> Gavin
>>
>>
>>
>> --
>> Yang Zhang
>> http://yz.mit.edu/
>
>
--
Yang Zhang
http://yz.mit.edu/
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2013-04-27 04:41:37 | Re: Optimizing bulk update performance |
Previous Message | Gavin Flower | 2013-04-27 02:01:07 | Re: Optimizing bulk update performance |