Re: Insert into on conflict, data size upto 3 billion records

From: Tim Cross <theophilusx(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Insert into on conflict, data size upto 3 billion records
Date: 2021-02-15 19:12:49
Message-ID: 87blclazw9.fsf@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Karthik K <kar6308(at)gmail(dot)com> writes:

> exactly, for now, what I did was, as the table is already partitioned, I
> created 50 different connections and tried updating the target table by
> directly querying from the source partition tables. Are there any other
> techniques that I can use to speed this up? also when we use on conflict
> statement for both insert and update does Postgres uses batching internally
> (committing for every 10000 records etc) or will it update all records at
> once, in that case, does it create a version for each record and do swap
> all at once? I'm wondering how atomicity is guaranteed, also if I have to
> do batching other than selecting from individual partitions does doing it
> batches of 10000 records help?
>

I have had pretty good success with the following strategy. However, you
will need to profile/test each assumption as things vary greatly
depending on data and table structure. A bit of trial and error is
usually required.

1. Use the \COPY command to upload the batch data into a temporary table
or at least a table with logging turned off

2. Run a stored procedure which first does updates for existing rows
then one which does inserts for non-existing rows into your final table

3. If your table is partitioned, pre-process your batches into separate
batches that are divided by the partition key, so instead of one big
batch, multiple smaller batches. If this is not possible, break your
upload up into multiple batches rather than one huge batch.

4. Optimise the update/insert statement to suit your data and table
structure, dropping any unnecessary indexes and re-building them once
finished (the whole upload). Note that this will need profiling as
depending on the index and index structure, dropping and re-creating can
be overall slower than leaving index in place.

5. Determine best times to run analyze to update table stats. Probably
want to do it after each update and insert run, but sometimes, may be
overall faster to just do it after each 'job' (update + insert).

6. don't forget to check the logs and watch for WAL writes being too
frequent etc. Often things are tuned for 'normal' (outside bulk uploads)
and are very poor for the bulk uploads. Need to make sure it is the
right balance.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Karthik K 2021-02-15 19:22:44 Re: Insert into on conflict, data size upto 3 billion records
Previous Message Rob Sargent 2021-02-15 19:05:57 Re: Insert into on conflict, data size upto 3 billion records