From: | Christopher Browne <cbbrowne(at)gmail(dot)com> |
---|---|
To: | Israel Brewster <ijbrewster(at)alaska(dot)edu> |
Cc: | "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: UPDATE many records |
Date: | 2020-01-06 19:08:09 |
Message-ID: | CAFNqd5X5FXzFL+vhx=mg40gv0nPV9cJ6yJZU9f5hFYCUbkg31Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Mon, 6 Jan 2020 at 13:36, Israel Brewster <ijbrewster(at)alaska(dot)edu> wrote:
> Thanks to a change in historical data, I have a need to update a large
> number of records (around 50 million). The update itself is straight
> forward, as I can just issue an "UPDATE table_name SET
> changed_field=new_value();" (yes, new_value is the result of a stored
> procedure, if that makes a difference) command via psql, and it should
> work. However, due to the large number of records this command will
> obviously take a while, and if anything goes wrong during the update (one
> bad value in row 45 million, lost connection, etc), all the work that has
> been done already will be lost due to the transactional nature of such
> commands (unless I am missing something).
>
> Given that each row update is completely independent of any other row, I
> have the following questions:
>
> 1) Is there any way to set the command such that each row change is
> committed as it is calculated?
> 2) Is there some way to run this command in parallel in order to better
> utilize multiple processor cores, other than manually breaking the data
> into chunks and running a separate psql/update process for each chunk?
> Honestly, manual parallelizing wouldn’t be too bad (there are a number of
> logical segregations I can apply), I’m just wondering if there is a more
> automatic option.
>
Yeah, I'd be inclined to do this in batches.
If, for instance, the table has a nice primary key, then I'd capture the
primary keys into a side table, and grab tuples from the side table to
process in more bite-sized batches, say, of a few thousand tuples per batch.
create table just_keys as select pk_column from big_historical_table;
alter table just_keys add column processed boolean;
create index jkpk on just_keys(pk_column) where (processed is null);
then loop repeatedly along the lines...
create temp table iteration as select pk_column from just_keys where
processed is null limit 1000;
[do update on big_historical_table where pk_column in (select pk_column
from iteration)]
update iteration set processed='true' where pk_column in (select pk_column
from iteration);
drop table iteration;
Parallelization is absolutely an interesting idea; if you want to use 8
processes, then use a cycling sequence on the side table to spread tuples
across the 8 processes, so that they can grab their own tuples and not
block one another.
In that case, more like...
create temp sequence seq_procs start with 1 maxval 8 cycle;
create temp table just_keys as select pk_column, false::boolean as
processed, nextval('seq_procs') as batch_id from big_historical_table;
The individual iterations then look for values in just_keys corresponding
to their assigned batch number.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
From | Date | Subject | |
---|---|---|---|
Next Message | Israel Brewster | 2020-01-06 20:07:44 | Re: UPDATE many records |
Previous Message | Adrian Klaver | 2020-01-06 19:07:14 | Re: UPDATE many records |