Re: How to do faster DML

From: Lok P <loknath(dot)73(at)gmail(dot)com>
To: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to do faster DML
Date: 2024-02-03 16:09:44
Message-ID: CAKna9VZB09CBVq2eV3ohSMTgXEcOSbUdYnfqzBtLm5qmuU9_dQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
7:37 PM (1 hour ago)
to *pgsql-general*
On Sat, Feb 3, 2024 at 7:37 PM Ron Johnson <ronljohnsonjr(at)gmail(dot)com> wrote:

> On Sat, Feb 3, 2024 at 8:55 AM Lok P <loknath(dot)73(at)gmail(dot)com> wrote:
>
>> Apology. One correction, the query is like below. I. E filter will be on
>> on ctid which I believe is equivalent of rowid in oracle and we will not
>> need the index on Id column then.
>>
>> But, it still runs long, so thinking any other way to make the duplicate
>> removal faster?
>>
>> Also wondering , the index creation which took ~2.5hrs+ , would that have
>> been made faster any possible way by allowing more db resource through some
>> session level db parameter setting?
>>
>> create table TAB1_New
>> as
>> SELECT * from TAB1 A
>> where CTID in
>> (select min(CTID) from TAB1
>> group by ID having count(ID)>=1 );
>>
>>
>> On Sat, Feb 3, 2024 at 5:50 PM Lok P <loknath(dot)73(at)gmail(dot)com> wrote:
>>
>>> Hello All,
>>> A non partitioned table having ~4.8 billion rows in it and having data
>>> size as ~1.4TB , row size as ~313 bytes having ~127 columns in it. This has
>>> got approx ~1billion+ duplicate rows inserted in it and we want to get the
>>> duplicate data removed for this table and create a PK/unique constraint
>>> back so as to not have the duplicate values in future. We are struggling to
>>> do the same.
>>>
>>> Teammates suggested doing this using CTAS method, i.e. create a new
>>> table with the unique record set and then drop the main table. Something as
>>> below
>>>
>>> create table TAB1_New
>>> as
>>> SELECT * from TAB1 A
>>> where ID in
>>> (select min(ID) from TAB1
>>> group by ID having count(ID)>=1 );
>>>
>>> But for the above to work faster , they mentioned to have an index
>>> created on the column using which the duplicate check will be performed i.e
>>> ID column. So, creating the index itself took ~2hrs+ and the index size now
>>> shows as ~116GB.
>>>
>>>
>>> *Create index idx1 on TAB1(ID)*
>>> And now running the SELECT subquery part of the CTAS statement to see if
>>> its giving the correct unique records count. It ran for 2.5 hrs and then we
>>> killed it. Below is the plan for the same.
>>>
>>> explain
>>> *select min(ID) from TAB1 A group by ID having count(ID)>=1*
>>>
>>> GroupAggregate (cost=0.71..6025790113.87 rows=29432861 width=46)
>>> Group Key: ID
>>> Filter: (count(ID) >= 1)
>>> -> Index Only Scan using idx1 on TAB1 a (cost=0.71..5988060903.17
>>> rows=4883397120 width=14)
>>>
>>> I want to understand if by any way this can be done faster . Also I am
>>> worried that creating PK constraint/index back after deleting the duplicate
>>> is also going to run forever. Is there any way we can make these heavy
>>> operations faster on postgre by facilitating more database resources
>>> through some parameter setup, like parallel hint etc? We have pg_hint_plan
>>> extension added, but not seeing the parallel hint enforced when adding it
>>> to the query.
>>>
>>> In Oracle we have Parallel hints, Direct path read/write for faster
>>> read/write operations, parallel index scan etc. available, if anything
>>> similar to that available in aurora postgre to facilitate more
>>> horsepower and speed up the batch operations. And , how can we monitor
>>> progress of any running query ?
>>> Just to note- It's a db.r7g.8xlarge aurora postgres instance, 32VCPU,
>>> 256GB RAM. PG version 15.4.
>>>
>>
> Aurora is not Postgresql, so configurations might not work. Having said
> that...
> https://www.postgresql.org/docs/15t/how-parallel-query-works.html
>
> And have you analyzed the table lately? Also, what's your work_mem
> and maintenance_work_mem?
>

Thank you .

Below are the values of the default parameters in this instance

SHOW max_worker_processes; - 128
show max_parallel_workers_per_gather;- 4
show max_parallel_workers;- 32
show max_parallel_maintenance_workers; - 2
show maintenance_work_mem; - 4155MB
show work_mem; - 8MB
show shared_buffers ; -22029684

When I ran the CTAS queries and index creation process , I had not a very
clear idea of how these are related to each other and help each of the
operations, but I set a few of those as below before triggering those in
the same session.

set max_parallel_workers_per_gather=16;
SET max_parallel_maintenance_workers TO 16;
SET maintenance_work_mem TO '16 GB';

The instance has a total ~256 GB memory, so how should I adjust/bump these
values when running heavy SELECT queries doing a large sequential scan OR
large index creation process OR any Select query with heavy sorting/"order
by" operations OR heavy JOINS?

I have not analyzed the table manually though , but seeing the auto_vaccum
and auto_analyze column getting populated in the pg_stat_user_tables , I
thought it must be doing that automatically.

By the way if we run "analyze tab1' on this 1.5TB table , will that run
longer and will any of the above parameters help to expedite that ANALYZE
operation too, if I run the ANALYZE manually?

Regards
Lok

>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Greg Sabino Mullane 2024-02-03 18:28:47 Re: How to do faster DML
Previous Message Ron Johnson 2024-02-03 14:06:49 Re: How to do faster DML