Re: Fastest way to insert/update many rows

From: Joe Abbate <jma(at)freedomcircle(dot)com>
To: psycopg(at)postgresql(dot)org
Subject: Re: Fastest way to insert/update many rows
Date: 2014-08-12 19:52:54
Message-ID: 53EA7096.6050801@freedomcircle.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: psycopg

Pascal,

On 12/08/14 09:46, pascal(at)ensieve(dot)org wrote:
> I'd like to psycopg2 to fetch a large number of rows (hundreds of millions), perform some computations and put them back into the database.
>
> I can fetch about 130k rows/sec with
> cur.execute('select * from stuff')
> keyvals = list(cur)
> and 100k/sec with
> f = io.StringIO()
> cur.copy_to(f, 'stuff')
> f.seek(0)
> keyvals = list(tuple(map(int, l.split('\t'))) for l in f)
>
> but inserting using
> cur.executemany('insert into stuff values (%s, %s)', keyvals)
> only has a throughput of 23k/sec with ca. 20% CPU used by Python, 80% by Postgres, while
> cur.copy_from(io.StringIO('\n'.join('{}\t{}'.format(*r) for r in keyvals)), 'stuff')
> manages to insert 1.8M/sec.
>
> I can't quite believe that generating a string should be the fastest method, am I missing something?
>
>
> What I'd really like to do is
> cur.executemany('update stuff set value = %s where key = %s', ...)
> but that was orders of magnitude slower still; probably because the order is random, so it performs an index lookup for each key.
> Populating a temporary table and using 'update stuff ... from temptable ...' is quicker.
>
> I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's
> update <table> set ... where current of <cursor>
> i.e. iterate through the rows in the most efficient way for the database.
>
>
> Or would it be wiser to use PL/Python for this kind of task instead?

In any relational database, the fastest way to do something is to have
the server do the work on a *set* of rows at a time. So, ideally, I'd
try to have the client issue a single statement like:

UPDATE stuff SET value = value_expression WHERE some_expression;

If you're going to update all the rows, then eliminate the WHERE clause
and let the server scan and update in whichever way it finds is most
efficient. If you need a WHERE, then try to use some_expression that
will take advantage of sequential scanning as much as possible. If
value_expression is too complicated, in PG you can use functions to deal
with the complexity (functions can also be used for some_expression).
You can use this technique even if you have to update more than one
column value.

If the decision logic is too complex, in PG you can use

SELECT * FROM some_func(arg1, arg2);

where some_func will implement the logic for deciding what rows to
update and what values to update.

Cheers,

Joe

In response to

Browse psycopg by date

  From Date Subject
Next Message Daniele Varrazzo 2014-08-12 21:23:36 Re: JSONB marshalling
Previous Message Daniele Varrazzo 2014-08-12 19:17:10 Re: Fastest way to insert/update many rows