Re: Insert into on conflict, data size upto 3 billion records

From: Ron <ronljohnsonjr(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Insert into on conflict, data size upto 3 billion records
Date: 2021-02-13 20:04:17
Message-ID: 2a876340-e227-c96a-1d79-f8a74aa45e4a@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2/12/21 12:46 PM, Karthik Kumar Kondamudi wrote:
> Hi,
>
> I'm looking for suggestions on how I can improve the performance of the
> below merge statement, we have a batch process that batch load the data
> into the _batch tables using Postgres and the task is to update the main
> target tables if the record exists else into it, sometime these batch
> table could go up to 5 billion records. Here is the current scenario
>
> |target_table_main| has 700,070,247  records and is hash partitioned into
> 50 chunks, it has an index on |logical_ts| and the batch table has
> 2,715,020,546 close to 3 billion records, so I'm dealing with a huge set
> of data so looking of doing this in the most efficient way.

Many times, I have drastically sped up batch processing by #1 partitioning
on the same field as an index, and #2 pre-sorting the input data by that field.

That way, you get excellent "locality of data" (meaning lots of writes to
the same hot bits of cache, which later get asynchronously flushed to
disk).  Unfortunately for your situation, the purpose of hash partitioning
is to /reduce/ locality of data.  (Sometimes that's useful, but *not* when
processing batches.)

--
Angular momentum makes the world go 'round.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Seamus Abshere 2021-02-14 00:09:11 Re: Why is Postgres only using 8 cores for partitioned count? [Parallel Append]
Previous Message Rob Sargent 2021-02-13 16:57:27 certs in connection string