Quick Links

Re: How to do faster DML

From:	"Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: How to do faster DML
Date:	2024-02-04 11:05:40
Message-ID:	20240204110540.veakdia4z62pyuap@hjp.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 2024-02-04 02:14:20 +0530, Lok P wrote:
> However , as we have ~5billion rows in the base table and out of that , we were
> expecting almost half i.e. ~2billion would be duplicates.

That's way more than I expected from your original description. And it
of course raises the question whether it's a good idea to just throw
away all that data or if you need to keep that in a normalized way.

> And you said, doing the inserts using the "order by CTID Offset"
> approach must cause one full sequential scan of the whole table for
> loading each chunk/10M of rows and that would take a long time I
> believe.
>
> I am still trying to understand the other approach which you suggested. Not
> able to understand "you can select where index_col > last order by index_col
> limit 10M," .
> However, to get the max ID value of the last 10M loaded rows in target, do you
> say that having an PK index created on that target table column(ID) will
> help,

Yes. Getting the maximum value from an index is a very fast operation.
You just have to traverse down the right edge of the tree (or you may
even be able to access the right-most leaf page directly).

> and we can save the max (ID) value subsequently in another table to fetch
> and keep loading from the source table (as ID>Max_ID stored in temp table)?

Another table or a variable in a script (personally, if I need to do
something repeatedly, I usually write a script in the scripting language
I feel most comfortable in (which has been Python for the last 7 or 8
years, Perl before that) which gives you variables, loops, conditionals
and - above all - repeatability.

> OR
> Would it be better to do it in one shot only , but by setting a higher value of
> some parameters like "maintenance_work_mem" or "max_parallel_workers"?

Hard to say. Normally, processing in fewer. bigger chunks is faster. But
RAM is much faster than disk (even with SSDs), so it might be faster to
make work_mem as large as you can and then use a chunk size which just
fits inside work_mem is faster. Of course finding that sweet spot takes
experimentation, hence time, and it may make little sense to experiment
for 20 hours just to save 40 minutes.

In response to

Re: How to do faster DML at 2024-02-03 20:44:20 from Lok P

Browse pgsql-general by date

	From	Date	Subject
Next Message	Alban Hertroys	2024-02-04 11:19:49	Re: How to do faster DML
Previous Message	Lok P	2024-02-03 20:44:20	Re: How to do faster DML