From: | Christopher Browne <cbbrowne(at)gmail(dot)com> |
---|---|
To: | Israel Brewster <ijbrewster(at)alaska(dot)edu> |
Cc: | "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: UPDATE many records |
Date: | 2020-01-06 20:38:02 |
Message-ID: | CAFNqd5U52H_e__mv9Lw5jS5PMLhjQuR7PM6SAyDXxhjSLP1Dog@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, Jan 6, 2020, 3:15 PM Israel Brewster <ijbrewster(at)alaska(dot)edu> wrote:
> On Jan 6, 2020, at 10:08 AM, Christopher Browne <cbbrowne(at)gmail(dot)com>
> wrote:
>
> On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster(at)alaska(dot)edu>
> wrote:
>
>> Thanks to a change in historical data, I have a need to update a large
>> number of records (around 50 million). The update itself is straight
>> forward, as I can just issue an "UPDATE table_name SET
>> changed_field=new_value();" (yes, new_value is the result of a stored
>> procedure, if that makes a difference) command via psql, and it should
>> work. However, due to the large number of records this command will
>> obviously take a while, and if anything goes wrong during the update (one
>> bad value in row 45 million, lost connection, etc), all the work that has
>> been done already will be lost due to the transactional nature of such
>> commands (unless I am missing something).
>>
>> Given that each row update is completely independent of any other row, I
>> have the following questions:
>>
>> 1) Is there any way to set the command such that each row change is
>> committed as it is calculated?
>> 2) Is there some way to run this command in parallel in order to better
>> utilize multiple processor cores, other than manually breaking the data
>> into chunks and running a separate psql/update process for each chunk?
>> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
>> logical segregations I can apply), I’m just wondering if there is a more
>> automatic option.
>>
>
> Yeah, I'd be inclined to do this in batches.
>
> If, for instance, the table has a nice primary key, then I'd capture the
> primary keys into a side table, and grab tuples from the side table to
> process in more bite-sized batches, say, of a few thousand tuples per batch.
>
> create table just_keys as select pk_column from big_historical_table;
> alter table just_keys add column processed boolean;
> create index jkpk on just_keys(pk_column) where (processed is null);
> then loop repeatedly along the lines...
>
> create temp table iteration as select pk_column from just_keys where
> processed is null limit 1000;
> [do update on big_historical_table where pk_column in (select pk_column
> from iteration)]
> update iteration set processed='true' where pk_column in (select pk_column
> from iteration);
> drop table iteration;
>
>
> Parallelization is absolutely an interesting idea; if you want to use 8
> processes, then use a cycling sequence on the side table to spread tuples
> across the 8 processes, so that they can grab their own tuples and not
> block one another.
>
> In that case, more like...
> create temp sequence seq_procs start with 1 maxval 8 cycle;
> create temp table just_keys as select pk_column, false::boolean as
> processed, nextval('seq_procs') as batch_id from big_historical_table;
>
> The individual iterations then look for values in just_keys corresponding
> to their assigned batch number.
>
>
> Sounds like a reasonable approach. As Justin pointed out, it is actually
> likely that the process will be IO bound rather than CPU bound, so my
> parallel idea may not have much merit after all, but the batching procedure
> makes sense. I assume you meant update just_keys in your sample rather than
> update iteration on that line just before drop table iteration. Thanks for
> the info
>
As for parallelism, if you have really powerful disk, lots of disks on disk
array, it may help. Or not, as commented.
I didn't test my wee bit of code, so yep, I meant to update just_keys :-).
You won't find something terribly much more automatic.
Oh, yah, there's a possible further complication; does the application need
to get stopped to do this update? Is the newest version of the app still
generating data that needs the rewriting? Sure hope not...
From | Date | Subject | |
---|---|---|---|
Next Message | Rob Sargent | 2020-01-06 20:40:39 | Re: UPDATE many records |
Previous Message | Alban Hertroys | 2020-01-06 20:29:25 | Re: UPDATE many records |