Re: is single row update improved with function

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: is single row update improved with function
Date: 2018-01-01 03:07:25
Message-ID: E9E1FEFB-D29B-4B1D-B574-A05877ABC13D@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


> On Dec 31, 2017, at 4:31 PM, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>
> On 1 January 2018 at 12:06, Rob Sargent <robjsargent(at)gmail(dot)com> wrote:
>> I must update 3M of 100M records, with tuple specific modifications. I can generate the necessary sql, but I’m wondering if files of simple update statements affecting a single row is more effective than files of a function call doing the same update given the necessary values, including where clause restrictions? The query plan set by the first should be decent for the remainder.
>>
>> Alternatively, would a bulk load into a table of replacement values and join info be the fastest way?
>
> It's probably also worth thinking about this table's usage pattern. If
> this table is an otherwise static table, then you may wish to think
> about the little bit of bloat that doing the UPDATEs in a single
> transaction would cause.
>
Sorry, I didn’t address the question about the table's usage. Currently we’re in a data loading phase and this table is almost completed. Thereafter (post vacuum analyze) it will be 99.99% read-only. The remainder will be updated in much the same fashion as described early (set events_x = events_x + increment).

This table will be analyzed a couple ways, mainly determining significance threshold across various slices, each of which examines roughly one tenth of the records.

>> Either way I can break the updates into roughly 393 transactions (7500 rows affected per tx) or 8646 transactions (350 rows per tx) if less is more in this world.
>
> If you were to perform the UPDATEs in batches it would allow you to
> run a VACUUM between the UPDATEs. However, it might not be so
> important as 3 million rows in 100 million is just 3%, so assuming all
> your rows are the same size, then even doing this as a single
> transaction would only cause 3% churn on the table. Possibly some of
> the UPDATEs would reuse existing free space within the table, but if
> they don't then it would only mean an extra 3% bloat.
>
> As for which is faster. It's most likely going to depend on the query
> plan for the UPDATE statements. If you need to perform 3 million seq
> scans on the table, by doing 3 million individual statements, that's
> likely not going to perform well. 3 million statements is likely not a
> good option in any case as it means parsing and planning 3 million
> UPDATE statements. Even your 393 statements might not be very good if
> each of those UPDATEs must perform a seq scans on the 100 million row
> table, but if each of those 393 statements can make use of an index to
> easily get those 7500 rows, then that might be a better option than
> doing the single UPDATE join method you mentioned.
>
> It does sound like something you could take offline and benchmark if
> performance is that critical. It's not really possible for us to tell
> which is faster without seeing the schema, UPDATE statements and query
> plans chosen.
>
> --
> David Rowley http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ibrahim Edib Kokdemir 2018-01-01 13:12:24 Re: invalid memory alloc request size 576460752438159360
Previous Message Rob Sargent 2018-01-01 00:04:02 Re: is single row update improved with function