Quick Links

Re: UPDATE many records

From:	Christopher Browne <cbbrowne(at)gmail(dot)com>
To:	Israel Brewster <ijbrewster(at)alaska(dot)edu>
Cc:	"pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Re: UPDATE many records
Date:	2020-01-06 20:38:02
Message-ID:	CAFNqd5U52H_e__mv9Lw5jS5PMLhjQuR7PM6SAyDXxhjSLP1Dog@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrewster(at)alaska(dot)edu> wrote:

> On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne(at)gmail(dot)com>
> wrote:
>
> On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster(at)alaska(dot)edu>
> wrote:
>
>> Thanks to a change in historical data, I have a need to update a large
>> number of records (around 50 million). The update itself is straight
>> forward, as I can just issue an "UPDATE table_name SET
>> changed_field=new_value();" (yes, new_value is the result of a stored
>> procedure, if that makes a difference) command via psql, and it should
>> work. However, due to the large number of records this command will
>> obviously take a while, and if anything goes wrong during the update (one
>> bad value in row 45 million, lost connection, etc), all the work that has
>> been done already will be lost due to the transactional nature of such
>> commands (unless I am missing something).
>>
>> Given that each row update is completely independent of any other row, I
>> have the following questions:
>>
>> 1) Is there any way to set the command such that each row change is
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better
>> utilize multiple processor cores, other than manually breaking the data
>> into chunks and running a separate psql/update process for each chunk?
>> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
>> logical segregations I can apply), I’m just wondering if there is a more
>> automatic option.
>>
>
> Yeah, I'd be inclined to do this in batches.
>
> If, for instance, the table has a nice primary key, then I'd capture the
> primary keys into a side table, and grab tuples from the side table to
> process in more bite-sized batches, say, of a few thousand tuples per batch.
>
> create table just_keys as select pk_column from big_historical_table;
> alter table just_keys add column processed boolean;
> create index jkpk on just_keys(pk_column) where (processed is null);
> then loop repeatedly along the lines...
>
> create temp table iteration as select pk_column from just_keys where
> processed is null limit 1000;
> [do update on big_historical_table where pk_column in (select pk_column
> from iteration)]
> update iteration set processed='true' where pk_column in (select pk_column
> from iteration);
> drop table iteration;
>
>
> Parallelization is absolutely an interesting idea; if you want to use 8
> processes, then use a cycling sequence on the side table to spread tuples
> across the 8 processes, so that they can grab their own tuples and not
> block one another.
>
> In that case, more like...
> create temp sequence seq_procs start with 1 maxval 8 cycle;
> create temp table just_keys as select pk_column, false::boolean as
> processed, nextval('seq_procs') as batch_id from big_historical_table;
>
> The individual iterations then look for values in just_keys corresponding
> to their assigned batch number.
>
>
> Sounds like a reasonable approach. As Justin pointed out, it is actually
> likely that the process will be IO bound rather than CPU bound, so my
> parallel idea may not have much merit after all, but the batching procedure
> makes sense. I assume you meant update just_keys in your sample rather than
> update iteration on that line just before drop table iteration. Thanks for
> the info
>

As for parallelism, if you have really powerful disk, lots of disks on disk
array, it may help. Or not, as commented.

I didn't test my wee bit of code, so yep, I meant to update just_keys :-).

You won't find something terribly much more automatic.

Oh, yah, there's a possible further complication; does the application need
to get stopped to do this update? Is the newest version of the app still
generating data that needs the rewriting? Sure hope not...

In response to

Re: UPDATE many records at 2020-01-06 20:15:10 from Israel Brewster

Responses

Re: UPDATE many records at 2020-01-06 20:47:17 from Israel Brewster

Browse pgsql-general by date

	From	Date	Subject
Next Message	Rob Sargent	2020-01-06 20:40:39	Re: UPDATE many records
Previous Message	Alban Hertroys	2020-01-06 20:29:25	Re: UPDATE many records