Re: How to do faster DML

From: Lok P <loknath(dot)73(at)gmail(dot)com>
To: Francisco Olarte <folarte(at)peoplecall(dot)com>
Cc: Greg Sabino Mullane <htamfids(at)gmail(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: How to do faster DML
Date: 2024-02-03 20:44:20
Message-ID: CAKna9VbwxbRquy2+xzpdug1M4MntKg9uaPx5KzwJtpYt0BQz-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Feb 4, 2024 at 12:50 AM Francisco Olarte <folarte(at)peoplecall(dot)com>
wrote:

> On Sat, 3 Feb 2024 at 19:29, Greg Sabino Mullane <htamfids(at)gmail(dot)com>
> wrote:
> ...
> > Given the size of your table, you probably want to divide that up.
> > As long as nothing is changing the original table, you could do:
> >
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 0;
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 10_000_000;
> > insert into mytable2 select * from mytable1 order by ctid limit
> 10_000_000 offset 20_000_000;
>
> Is it me or does this have the faint smell of quadratic behaviour? I
> mean, you need to read and discard the first 10M to do offset 10M (
> also I believe ctid order is implied in sequential scan, but no so
> sure, if not it would need a full table sort on each pass ).
>
> When doing things like this, I normally have some kind of unique index
> and do it by selecting with limit above the last read value( stored
> when doing it via code, as I normally load first, index later so I
> cannot select max() fast on the target ). Or, with some kind of
> "sparse" index (meaning, groups much less than the batch size ) and a
> little code you can select where index_col > last order by index_col
> limit 10M, remember last received index_col and reselect discarding
> missing ( or just reselect it when doing your on conflict do nothing,
> which also has a fast select max(id) due to the PK, it will work if it
> has an index on id column on the original even if not unique ) to
> avoid that.
>
> Also, I'm not sure if ctid is ordered and you can select where
> ctid>last ordered, if that works it probably is faster for immutable
> origins.
>
> Francisco Olarte.

Thank you.

Yes , I think the "on conflict do nothing;" option looks promising as it
will remove the duplicate in case of PK violation but keepte load continue
for subsequent rows.

However , as we have ~5billion rows in the base table and out of that , we
were expecting almost half i.e. ~2billion would be duplicates. And you
said, doing the inserts using the "order by CTID Offset" approach must
cause one full sequential scan of the whole table for loading each
chunk/10M of rows and that would take a long time I believe.

I am still trying to understand the other approach which you suggested. Not
able to understand "y*ou can select where index_col > last order by
index_col **limit 10M," .*
However, to get the max ID value of the last 10M loaded rows in target, do
you say that having an PK index created on that target table column(ID)
will help, and we can save the max (ID) value subsequently in another table
to fetch and keep loading from the source table (as ID>Max_ID stored in
temp table)?
OR
Would it be better to do it in one shot only , but by setting a higher
value of some parameters like "maintenance_work_mem" or
"max_parallel_workers"?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter J. Holzer 2024-02-04 11:05:40 Re: How to do faster DML
Previous Message Peter J. Holzer 2024-02-03 20:29:47 Re: How to do faster DML