From: | Joe Abbate <jma(at)freedomcircle(dot)com> |
---|---|
To: | psycopg(at)postgresql(dot)org |
Subject: | Re: Fastest way to insert/update many rows |
Date: | 2014-08-12 19:52:54 |
Message-ID: | 53EA7096.6050801@freedomcircle.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | psycopg |
Pascal,
On 12/08/14 09:46, pascal(at)ensieve(dot)org wrote:
> I'd like to psycopg2 to fetch a large number of rows (hundreds of millions), perform some computations and put them back into the database.
>
> I can fetch about 130k rows/sec with
> cur.execute('select * from stuff')
> keyvals = list(cur)
> and 100k/sec with
> f = io.StringIO()
> cur.copy_to(f, 'stuff')
> f.seek(0)
> keyvals = list(tuple(map(int, l.split('\t'))) for l in f)
>
> but inserting using
> cur.executemany('insert into stuff values (%s, %s)', keyvals)
> only has a throughput of 23k/sec with ca. 20% CPU used by Python, 80% by Postgres, while
> cur.copy_from(io.StringIO('\n'.join('{}\t{}'.format(*r) for r in keyvals)), 'stuff')
> manages to insert 1.8M/sec.
>
> I can't quite believe that generating a string should be the fastest method, am I missing something?
>
>
> What I'd really like to do is
> cur.executemany('update stuff set value = %s where key = %s', ...)
> but that was orders of magnitude slower still; probably because the order is random, so it performs an index lookup for each key.
> Populating a temporary table and using 'update stuff ... from temptable ...' is quicker.
>
> I have to set one column in each row, is there a way to update cursors like in PL/pgSQL's
> update <table> set ... where current of <cursor>
> i.e. iterate through the rows in the most efficient way for the database.
>
>
> Or would it be wiser to use PL/Python for this kind of task instead?
In any relational database, the fastest way to do something is to have
the server do the work on a *set* of rows at a time. So, ideally, I'd
try to have the client issue a single statement like:
UPDATE stuff SET value = value_expression WHERE some_expression;
If you're going to update all the rows, then eliminate the WHERE clause
and let the server scan and update in whichever way it finds is most
efficient. If you need a WHERE, then try to use some_expression that
will take advantage of sequential scanning as much as possible. If
value_expression is too complicated, in PG you can use functions to deal
with the complexity (functions can also be used for some_expression).
You can use this technique even if you have to update more than one
column value.
If the decision logic is too complex, in PG you can use
SELECT * FROM some_func(arg1, arg2);
where some_func will implement the logic for deciding what rows to
update and what values to update.
Cheers,
Joe
From | Date | Subject | |
---|---|---|---|
Next Message | Daniele Varrazzo | 2014-08-12 21:23:36 | Re: JSONB marshalling |
Previous Message | Daniele Varrazzo | 2014-08-12 19:17:10 | Re: Fastest way to insert/update many rows |