Re: update faster way

From: "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: update faster way
Date: 2024-09-15 22:18:33
Message-ID: 20240915221833.l2zhege3o3okfhgq@hjp.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2024-09-14 20:26:32 +0530, yudhi s wrote:
>
>
> On Sat, Sep 14, 2024 at 4:55 PM Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> wrote:
>
>
> Which in turn means that you want as little overhead as possible per
> batch which means finding those 5000 rows should be quick. Which brings
> us back to Igor's question: Do you have any indexes in place which speed
> up finding those 5000 rows (the primary key almost certainly won't help
> with that). EXPLAIN (ANALYZE) (as suggested by Laurenz) will certainly
> help answering that question.
>
> > And also those rows will not collide with each other. So do you think
> > that approach can anyway cause locking issues?
>
> No, I don't think so. With a batch size that small I wouldn't expect
> problems even on the live partition. But of course many busy parallel
> sessions will put additional load on the system which may or may not be
> noticeable by users (you might saturate the disks writing WAL entries
> for example, which would slow down other sessions trying to commit).
>
>
> > Regarding batch update with batch size of 1000, do we have any method
> exists in
> > postgres (say like forall statement in Oracle) which will do the batch
> dml. Can
> > you please guide me here, how we can do it in postgres.
>
> Postgres offers several server side languages. As an Oracle admin you
> will probably find PL/pgSQL most familiar. But you could also use Perl
> or Python or several others. And of course you could use any
> programming/scripting language you like on the client side.
>
>
>
>  When you said "(the primary key almost certainly won't help with that)", I am
> trying to understand why it is so ? 

I was thinking that you would do something like

begin;

update with a as (
select id from the_table
where :part_lower <= id and id < :part_upper and col_x = :old
limit 5000
)
update the_table set col_x = :new
from a where the_table.id = a.id;

commit;

in a loop until you you update 0 rows and then switch to the next
partition. That pretty much requires an index on col_x or you will need
a sequential scan to find the next 5000 rows to update.

Even if you return the ids and leed the last updated id back into the
loop like this:

update with a as (
select id from the_table
where id > :n and col_x = :old
order by id
limit 5000
)
update the_table set col_x = :new
from a where the_table.id = a.id;

that may lead to a lot of extra reads from the heap or the optimizer
might even decide it's better to go for a sequential scan.

The latter is pretty unlikely if you restrict the range of ids:

update the_table set col_x = :new
where :n <= id and id < :n + 5000 and col_x = :old;

but that will possible result in a lot of queries which don't update
anything at all but still need to read 5000 rows each.

> I was thinking of using that column as an incrementing filter and driving the
> eligible rows based on that filter. And if it would have been a sequence. I
> think it would have helped but in this case it's UUID , so I may not be able to
> do the batch DML using that as filter criteria.

You can order by uuid or compare them to other uuids. So my first two
approaches above would still work.

> but in that case will it be fine to drive the update based on ctid
> something as below? Each session will have the range of 5 days of data
> or five partition data and will execute a query something as below
> which will update in the batches of 10K and then commit. Is this fine?
> Or is there some better way of doing the batch DML in postgres
> plpgsql?
>
> DO $$
> DECLARE
>     l_rowid_array ctid[];
>     l_ctid ctid;
>     l_array_size INT := 10000;
>     l_processed INT := 0;
> BEGIN
>  
>     FOR l_cnt IN 0..(SELECT COUNT(*) FROM part_tab WHERE   part_date >
> '1-sep-2024' and part_date < '5-sep-2024'
> ) / l_array_size LOOP
>         l_rowid_array := ARRAY(
>             SELECT ctid
>             FROM part_tab
>             WHERE part_date   > '1-sep-2024' and part_date < '5-sep-2024'
>             LIMIT l_array_size OFFSET l_cnt * l_array_size

Never use LIMIT and OFFSET without an ORDER BY, especially not when you
are updating the table. You may get some rows twice and some never.
ALso OFFSET means you are reading all those rows and then ignoring
them. I expect this to be O(n²).

>         );
>  
>         FOREACH l_ctid IN ARRAY l_rowid_array LOOP
>             update  part_tab
>             SET column1 = reftab.code
>            FROM reference_tab reftab

I see you are using a reference table and I think you mentioned that you
will be updating most rows. So that alleviates my concerns that you may
read lots of rows without updating them. But you still need an efficient
way to get at the next rows to update.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp(at)hjp(dot)at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andy Hartman 2024-09-16 15:35:35 load fom csv
Previous Message Peter J. Holzer 2024-09-15 21:22:48 Re: Manual query vs trigger during data load