From: | Yang Zhang <yanghatespam(at)gmail(dot)com> |
---|---|
To: | Misa Simic <misa(dot)simic(at)gmail(dot)com> |
Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Optimizing bulk update performance |
Date: | 2013-04-27 10:16:40 |
Message-ID: | CAKxBDU-BjOxpoOXBjQpbHn9axS1DwCtfYUm3BCZXrE2tOH3F7g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:
> I dont know - u can test :)
I probably will, but I do have a huge stack of such experiments to run
by now, and it's always tricky / takes care to get benchmarks right,
avoid disk caches, etc. Certainly I think it would be helpful (or at
least hopefully not harmful) to ask here to see if anyone might just
know. That's what brought me to this list. :)
>
> In whole solution it is just one command different - so easy to test and
> compare...
>
> To me it doesnt sound as faster... Sounds as more operation needed what
> should be done...
>
> And produce more problems...i.e what with table foo? What if another table
> refference foo etc...
Yep, I guess more specifically I was just thinking of dumping to a temp table:
CREATE TEMP TABLE tmp AS
SELECT * FROM foo;
TRUNCATE foo;
INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
The question I have remaining is whether the bulk UPDATE will be able
to update many rows efficiently (smartly order them to do largely
sequential scans) - if so, I imagine it would be faster than the
above.
>
> On Saturday, April 27, 2013, Yang Zhang wrote:
>>
>> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa(dot)simic(at)gmail(dot)com> wrote:
>> > Hi,
>> >
>> > If dataset for update is large...
>> >
>> > Maybe best would be:
>> >
>> > From client machine, instead of sending update statements with data -
>> > export
>> > data to file ready for copy command
>> > Transfer file to the server where pg is running
>> > Make pgsql function which
>> >
>> > Create temp table
>> > Copy to temp from the file
>> >
>> > Update original table with values in temp
>> >
>> > UPDATE foo
>> > SET foo.col1 = bar.col1
>> > FROM bar
>> > WHERE foo.id = bar.id
>> >
>> > You dont need to do delete/insert - if you have just update comands....
>> >
>> > From client when file is transfered - call your import function on the
>> > the
>> > server
>> >
>> > Optionaly you can run vacuum analyze after bulk operation...
>>
>> But wouldn't a bulk UPDATE touch many existing pages (say, 20%
>> scattered around) to mark rows as dead (per MVCC)? I guess it comes
>> down to: will PG be smart enough to mark dead rows in largely
>> sequential scans (rather than, say, jumping around in whatever order
>> rows from foo are yielded by the above join)?
>>
>> In other words, when considering the alternative of:
>>
>> CREATE TABLE newfoo AS
>> SELECT * FROM bar
>> UNION
>> SELECT * FROM foo
>> WHERE id NOT IN (SELECT id FROM bar);
>>
>> Wouldn't this alternative be faster?
>>
>> >
>> > Kind regards,
>> >
>> > Misa
>> >
>> >
>> >
>> > On Saturday, April 27, 2013, Yang Zhang wrote:
>> >>
>> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >> > Yang Zhang <yanghatespam(at)gmail(dot)com> writes:
>> >> >> It currently takes up to 24h for us to run a large set of UPDATE
>> >> >> statements on a database, which are of the form:
>> >> >
>> >> >> UPDATE table SET field1 = constant1, field2 = constant2, ...
>> >> >> WHERE
>> >> >> id = constid
>> >> >
>> >> >> (We're just overwriting fields of objects identified by ID.)
>> >> >
>> >> > Forgive the obvious question, but you do have an index on "id",
>> >> > right?
>> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of
>> >> > these)?
>> >>
>> >> Totally valid question. That is the primary key with its own index.
>> >> Yes, we verified that explain says it just use a simple index scan.
>> >> Each individual query runs reasonably quickly (we can run several
>> >> dozen such statements per second).
>> >>
>> >> >
>> >> >> The tables have handfuls of indices each and no foreign key
>> >> >> constraints.
>> >> >
>> >> > How much is a "handful"?
>> >>
>> >> The table with the largest volume of updates (our bottleneck) has four
>> >> indexes:
>> >>
>> >> "account_pkey" PRIMARY KEY, btree (id)
>> >> "account_createddate" btree (createddate)
>> >> "account_id_prefix" btree (id text_pattern_ops)
>> >> "account_recordtypeid" btree (recordtypeid)
>> >>
>> >> >
>> >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems like
>> >> >> a
>> >> >> baseline we should reasonably target.
>> >> >
>> >> > Well, maybe. You didn't say what percentage of the DB you're
>> >> > updating.
>> >>
>> >> It can be 10-50% of rows changed - a large portion.
>> >>
>> >> >
>> >> > But the thing that comes to mind here is that you're probably
>> >> > incurring
>> >> > a network round trip for each row, and maybe a query-planning round
>> >> > as
>> >> > well, so you really can't expect that this is going to be anywhere
>> >> > near
>> >> > as efficient as a bulk load operation. You could presumably get rid
>> >> > of
>> >> > the planner overhead by using a prepared statement. Cutting the
>> >> > network
>> >> > overhead is going to require a bit more ingenuity --- could you move
>> >> > some logic into a stored procedure, perhaps, so that one command from
>> >> > the client is sufficient to update multiple rows?
>> >>
>> >> You're right, we're only sequentially issuing (unprepared) UPDATEs.
>> >>
>> >> If we ship many UPDATE statements per call to our DB API's execution
>> >> function (we're using Python's psycopg2 if that matters, but I think
>> >> that just binds libpq), would that avoid the network round trip per
>> >> statement?
>> >>
>> >> If not, what if we use anonymous procedures (DO) to run multiple
>> >> UPDATE statements?
>> >>
>> >> Finally, we could use the technique highlighted in my third bullet and
>> >> use COPY (or at least multiple-value INSERT), then merging the new
>> >> data with the old. Would that be the most direct route to maximum
>> >> performance?
>> >>
>> >> In any case, I assume deleting and rebuilding indexes is important
>> >> here, yes? But what about raising checkpoint_segments - does this
>> >> actually help sustained throughput?
>> >>
>> >> >
>> >> > regards, tom lane
>> >>
>> >>
>> >> --
>> >> Yang Zhang
>> >> http://yz.mit.edu/
>> >>
>> >>
>> >> --
>> >> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>> >> To make changes to your subscription:
>> >> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>
--
Yang Zhang
http://yz.mit.edu/
From | Date | Subject | |
---|---|---|---|
Next Message | Misa Simic | 2013-04-27 10:37:39 | Re: Optimizing bulk update performance |
Previous Message | Misa Simic | 2013-04-27 10:06:07 | Re: Optimizing bulk update performance |