Re: Optimizing bulk update performance

From: Misa Simic <misa(dot)simic(at)gmail(dot)com>
To: Yang Zhang <yanghatespam(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Optimizing bulk update performance
Date: 2013-04-27 10:37:39
Message-ID: CAH3i69=oqgZSZoeBpKcDZZWyGr8hXGWJGd188wWZH43o-erAVg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Well

About best approach with large datasets - rarely there is "always
true" best principle...

You will always see there are a few ways - best one just test confirms -
depends on many things like hardware os etc... Sometimes even depends on
dataset for update...

"
CREATE TEMP TABLE tmp AS
SELECT * FROM foo;

TRUNCATE foo;

INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
"

Above doesnt amke sense to me..

I would do:

CREATE TEMP TABLE tmp AS
SELECT * FROM foo where 1=2;

COPY tmp FROM 'pathtofile';

UPDATE foo
SET foo.col1 = tmp.col1,
.
.
.
SET foo.col15 = tmp.col15
FROM tmp
WHERE foo.id = tmp.id;

In case I know I need just update... If in my dataset I have mix for update
and potentially new rows

Instead of update command, I would do

DELETE FROM foo WHERE EXISTS (SELECT 1 FROM tmp WHERE tmp.id = foo.id);
INSERT INTO foo
SELECT * FROM tmp;

On Saturday, April 27, 2013, Yang Zhang wrote:

> On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic <misa(dot)simic(at)gmail(dot)com<javascript:;>>
> wrote:
> > I dont know - u can test :)
>
> I probably will, but I do have a huge stack of such experiments to run
> by now, and it's always tricky / takes care to get benchmarks right,
> avoid disk caches, etc. Certainly I think it would be helpful (or at
> least hopefully not harmful) to ask here to see if anyone might just
> know. That's what brought me to this list. :)
>
> >
> > In whole solution it is just one command different - so easy to test and
> > compare...
> >
> > To me it doesnt sound as faster... Sounds as more operation needed what
> > should be done...
> >
> > And produce more problems...i.e what with table foo? What if another
> table
> > refference foo etc...
>
> Yep, I guess more specifically I was just thinking of dumping to a temp
> table:
>
> CREATE TEMP TABLE tmp AS
> SELECT * FROM foo;
>
> TRUNCATE foo;
>
> INSERT INTO foo
> SELECT * FROM bar
> UNION
> SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);
>
> The question I have remaining is whether the bulk UPDATE will be able
> to update many rows efficiently (smartly order them to do largely
> sequential scans) - if so, I imagine it would be faster than the
> above.
>
> >
> > On Saturday, April 27, 2013, Yang Zhang wrote:
> >>
> >> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa(dot)simic(at)gmail(dot)com>
> wrote:
> >> > Hi,
> >> >
> >> > If dataset for update is large...
> >> >
> >> > Maybe best would be:
> >> >
> >> > From client machine, instead of sending update statements with data -
> >> > export
> >> > data to file ready for copy command
> >> > Transfer file to the server where pg is running
> >> > Make pgsql function which
> >> >
> >> > Create temp table
> >> > Copy to temp from the file
> >> >
> >> > Update original table with values in temp
> >> >
> >> > UPDATE foo
> >> > SET foo.col1 = bar.col1
> >> > FROM bar
> >> > WHERE foo.id = bar.id
> >> >
> >> > You dont need to do delete/insert - if you have just update
> comands....
> >> >
> >> > From client when file is transfered - call your import function on the
> >> > the
> >> > server
> >> >
> >> > Optionaly you can run vacuum analyze after bulk operation...
> >>
> >> But wouldn't a bulk UPDATE touch many existing pages (say, 20%
> >> scattered around) to mark rows as dead (per MVCC)? I guess it comes
> >> down to: will PG be smart enough to mark dead rows in largely
> >> sequential scans (rather than, say, jumping around in whatever order
> >> rows from foo are yielded by the above join)?
> >>
> >> In other words, when considering the alternative of:
> >>
> >> CREATE TABLE newfoo AS
> >> SELECT * FROM bar
> >> UNION
> >> SELECT * FROM foo
> >> WHERE id NOT IN (SELECT id FROM bar);
> >>
> >> Wouldn't this alternative be faster?
> >>
> >> >
> >> > Kind regards,
> >> >
> >> > Misa
> >> >
> >> >
> >> >
> >> > On Saturday, April 27, 2013, Yang Zhang wrote:
> >> >>
> >> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> >> > Yang Zhang <yanghatespam(at)gmail(dot)com> writes:
> >> >> >> It currently takes up to 24h for us to run a large set of UPDATE
> >> >> >> statements on a database, which are of the form:
> >> >> >
> >> >> >> UPDATE table SET field1 = constant1, field2 = constant2, ...
> >> >> >> WHERE
> >> >> >> id = constid
> >> >> >
> >> >> >> (We're just overwriting fields of objects identified by ID.)
> >> >> >
> >> >> > Forgive the obvious question, but you do have an index on "id",
> >> >> > right?
> >> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of
> >> >> > these)?
> >> >>
> >> >> Totally valid question. That is the primary key with its own index.
> >> >> Yes, we verified that explain says it just use a simple index scan.
> >> >> Each individual query runs reasonably quickly (we can run several
> >> >> dozen such statements per second).
> >> >>
> >> >> >
> >> >> >> The tables have handfuls of indices each and no foreign key
> >> >> >> constraints.
> >> >> >
> >> >> > How much is a "handful"?
> >> >>
> >> >> The table with the largest volume of updates (our bottleneck) has
> four
> >> >> indexes:
> >> >>
> >> >> "account_pkey" PRIMARY KEY, btree (id)
> >> >> "account_createddate" btree (createddate)
> >> >> "account_id_prefix" btree (id text_pattern_ops)
> >> >> "account_recordtypeid" btree (recordtypeid)
> >> >>
> >> >> >
> >> >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems
> like
> >> >> >> a
> >

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jov 2013-04-27 11:25:29 Re: Basic question on recovery and disk snapshotting
Previous Message Yang Zhang 2013-04-27 10:16:40 Re: Optimizing bulk update performance