Re: How to do faster DML

From: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to do faster DML
Date: 2024-02-03 14:06:49
Message-ID: CANzqJaAYq+LNYuQijmrP5Zu+QNSO9eMXc2V=fEZme7HZDp5Y3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Feb 3, 2024 at 8:55 AM Lok P <loknath(dot)73(at)gmail(dot)com> wrote:

> Apology. One correction, the query is like below. I. E filter will be on
> on ctid which I believe is equivalent of rowid in oracle and we will not
> need the index on Id column then.
>
> But, it still runs long, so thinking any other way to make the duplicate
> removal faster?
>
> Also wondering , the index creation which took ~2.5hrs+ , would that have
> been made faster any possible way by allowing more db resource through some
> session level db parameter setting?
>
> create table TAB1_New
> as
> SELECT * from TAB1 A
> where CTID in
> (select min(CTID) from TAB1
> group by ID having count(ID)>=1 );
>
>
> On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath(dot)73(at)gmail(dot)com> wrote:
>
>> Hello All,
>> A non partitioned table having ~4.8 billion rows in it and having data
>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>> duplicate data removed for this table and create a PK/unique constraint
>> back so as to not have the duplicate values in future. We are struggling to
>> do the same.
>>
>> Teammates suggested doing this using CTAS method, i.e. create a new table
>> with the unique record set and then drop the main table. Something as below
>>
>> create table TAB1_New
>> as
>> SELECT * from TAB1 A
>> where ID in
>> (select min(ID) from TAB1
>> group by ID having count(ID)>=1 );
>>
>> But for the above to work faster , they mentioned to have an index
>> created on the column using which the duplicate check will be performed i.e
>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>> shows as ~116GB.
>>
>>
>> *Create index idx1 on TAB1(ID)*
>> And now running the SELECT subquery part of the CTAS statement to see if
>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>> killed it. Below is the plan for the same.
>>
>> explain
>> *select min(ID) from TAB1 A group by ID having count(ID)>=1*
>>
>> GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46)
>> Group Key: ID
>> Filter: (count(ID) >= 1)
>> -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17
>> rows=4883397120 width=14)
>>
>> I want to understand if by any way this can be done faster . Also I am
>> worried that creating PK constraint/index back after deleting the duplicate
>> is also going to run forever. Is there any way we can make these heavy
>> operations faster on postgre by facilitating more database resources
>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>> extension added, but not seeing the parallel hint enforced when adding it
>> to the query.
>>
>> In Oracle we have Parallel hints, Direct path read/write for faster
>> read/write operations, parallel index scan etc. available, if anything
>> similar to that available in aurora postgre to facilitate more
>> horsepower and speed up the batch operations. And , how can we monitor
>> progress of any running query ?
>> Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>> 256GB RAM. PG version 15.4.
>>
>
Aurora is not Postgresql, so configurations might not work. Having said
that...
https://www.postgresql.org/docs/15t/how-parallel-query-works.html

And have you analyzed the table lately? Also, what's your work_mem
and maintenance_work_mem?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Lok P 2024-02-03 16:09:44 Re: How to do faster DML
Previous Message Todd Lewis 2024-02-03 13:58:46 Re: How to do faster DML