Re: Insert into on conflict, data size upto 3 billion records

From: Karthik K <kar6308(at)gmail(dot)com>
To: Ron <ronljohnsonjr(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Insert into on conflict, data size upto 3 billion records
Date: 2021-02-15 18:41:40
Message-ID: CAExg72eOp0cOmaptkO9B09yErd7hEKjxe=PVUcGqX5RoJ0jWKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

exactly, for now, what I did was, as the table is already partitioned, I
created 50 different connections and tried updating the target table by
directly querying from the source partition tables. Are there any other
techniques that I can use to speed this up? also when we use on conflict
statement for both insert and update does Postgres uses batching internally
(committing for every 10000 records etc) or will it update all records at
once, in that case, does it create a version for each record and do swap
all at once? I'm wondering how atomicity is guaranteed, also if I have to
do batching other than selecting from individual partitions does doing it
batches of 10000 records help?

On Sat, Feb 13, 2021 at 12:04 PM Ron <ronljohnsonjr(at)gmail(dot)com> wrote:

> On 2/12/21 12:46 PM, Karthik Kumar Kondamudi wrote:
>
> Hi,
>
> I'm looking for suggestions on how I can improve the performance of the
> below merge statement, we have a batch process that batch load the data
> into the _batch tables using Postgres and the task is to update the main
> target tables if the record exists else into it, sometime these batch table
> could go up to 5 billion records. Here is the current scenario
>
> target_table_main has 700,070,247 records and is hash partitioned into
> 50 chunks, it has an index on logical_ts and the batch table has
> 2,715,020,546 close to 3 billion records, so I'm dealing with a huge set of
> data so looking of doing this in the most efficient way.
>
>
> Many times, I have drastically sped up batch processing by #1 partitioning
> on the same field as an index, and #2 pre-sorting the input data by that
> field.
>
> That way, you get excellent "locality of data" (meaning lots of writes to
> the same hot bits of cache, which later get asynchronously flushed to
> disk). Unfortunately for your situation, the purpose of hash partitioning
> is to *reduce* locality of data. (Sometimes that's useful, but *not*
> when processing batches.)
>
> --
> Angular momentum makes the world go 'round.
>

--
Regards,

Karthik K Kondamudi

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rob Sargent 2021-02-15 19:05:57 Re: Insert into on conflict, data size upto 3 billion records
Previous Message Alvaro Herrera 2021-02-15 18:41:00 Re: How to post to this mailing list from a web based interface