How to do faster DML

From: Lok P <loknath(dot)73(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: How to do faster DML
Date: 2024-02-03 12:20:02
Message-ID: CAKna9VbS8qeUj1NpzodAxgOEJGp+g0zE9BfEsoGvYgL7Jr+CnQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello All,
A non partitioned table having ~4.8 billion rows in it and having data size
as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has got
approx ~1billion+ duplicate rows inserted in it and we want to get the
duplicate data removed for this table and create a PK/unique constraint
back so as to not have the duplicate values in future. We are struggling to
do the same.

Teammates suggested doing this using CTAS method, i.e. create a new table
with the unique record set and then drop the main table. Something as below

create table TAB1_New
as
SELECT * from TAB1 A
where ID in
(select min(ID) from TAB1
group by ID having count(ID)>=1 );

But for the above to work faster , they mentioned to have an index created
on the column using which the duplicate check will be performed i.e ID
column. So, creating the index itself took ~2hrs+ and the index size now
shows as ~116GB.

*Create index idx1 on TAB1(ID)*
And now running the SELECT subquery part of the CTAS statement to see if
its giving the correct unique records count. It ran for 2.5 hrs and then we
killed it. Below is the plan for the same.

explain
*select min(ID) from TAB1 A group by ID having count(ID)>=1*

GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46)
Group Key: ID
Filter: (count(ID) >= 1)
-> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17
rows=4883397120 width=14)

I want to understand if by any way this can be done faster . Also I am
worried that creating PK constraint/index back after deleting the duplicate
is also going to run forever. Is there any way we can make these heavy
operations faster on postgre by facilitating more database resources
through some parameter setup, like parallel hint etc? We have pg_hint_plan
extension added, but not seeing the parallel hint enforced when adding it
to the query.

In Oracle we have Parallel hints, Direct path read/write for faster
read/write operations, parallel index scan etc. available, if anything
similar to that available in aurora postgre to facilitate more
horsepower and speed up the batch operations. And , how can we monitor
progress of any running query ?
Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
256GB RAM. PG version 15.4.

Regards
Lok

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Lok P 2024-02-03 13:55:12 Re: How to do faster DML
Previous Message Chris Angelico 2024-02-03 06:06:35 Re: Logical replication and AFTER UPDATE triggers [PG 16]