Re: Performance penalty during logical postgres replication

From: Lars Vonk <lars(dot)vonk(at)gmail(dot)com>
To: Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Performance penalty during logical postgres replication
Date: 2020-12-10 13:08:10
Message-ID: CAMX1ThjnU0OXutAOSNwsfit-juwcjf7bWq0j56BbJh_aLhekUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

It has been 4 hours and it is safe to say that the measurements we took
have a huge positive effect: > 30 times faster and no noticeable effect on
the running Primary at all.
A 20GB table is now replicated under 10 minutes.

- We removed all non PK and unique indices from the large tables
- We the changed the query on the queue table to add the 'SKIP LOCKED'
clause.
- We do a per table approach for the larger tables.

I think the indices have the most significant impact, but not sure how to
proof this since we did multiple changes at the same time.

Thanks again for the tips!

-- Lars

On Thu, Dec 10, 2020 at 9:12 AM Lars Vonk <lars(dot)vonk(at)gmail(dot)com> wrote:

> Hi,
>
> - on the receiving side, avoid creating indexes on the tables: create just
>> a necessary PK or UK, wait for the initial load to complete and then add
>> all the rest ones
>>
>
> Thanks, this is a good tip. We are going to add this
>
> We also noticed the code that was getting the next from the "queue" was
> doing the query with 'select for update', but without a 'SKIP LOCKED'. This
> is probably also something that caused more wait time if the server is
> more busy as usual during the replication. So we are going to add this and
> try again. We are also minimizing load on the queue during initial
> replication.
>
> On to the next try.
>
> Lars
>
> On Wed, Dec 9, 2020 at 6:45 PM Victor Yegorov <vyegorov(at)gmail(dot)com> wrote:
>
>> ср, 9 дек. 2020 г. в 10:21, Lars Vonk <lars(dot)vonk(at)gmail(dot)com>:
>>
>>> We are doing a logical postgres replication from Postgres 11 to 12. Our
>>> database is around 700GB (8 cpu's, 32 GB).
>>> During the replication process, at some point, we see a huge performance
>>> penalty on a particular table. This table acts as a queue with lots of
>>> inserts and deletes happening throughout the day. For most of the time this
>>> table is empty, but during this performance penalty the number of rows in
>>> this table grows to 10.000 rows, and processing is not fast enough to empty
>>> this table. Main reason for this (as far as we see) is that the performance
>>> of the query for selecting the next row to process drops from < 10MS to
>>> 400MS. This eventually causes too much cpu load on the Primary and we have
>>> to cancel the replication process.
>>>
>> We already tried the initial load three times, and it consistently fails
>>> with the same "error". Last try was a per table approach and excluding this
>>> "queue" table.
>>> After cancelling the replication the query is fast again and the load on
>>> the Primary goes back to normal. We see that this happens when replicating
>>> large tables (> millions of rows). During this performance penalty the
>>> explain of the query selecting the next row from this table tells us it is
>>> doing a sequential scan (there is an index but it is not used).
>>>
>>> - What could cause this performance penalty?
>>> - Is this something other people experienced as well during the initial
>>> load of a logical replication with large tables?
>>> - We are now thinking of temporarily increasing the number of CPU's and
>>> RAM for the migration. Would this help in this case?
>>>
>>
>> I've seen similar symptoms in cases with (a) home-made queues in the
>> tables and (b) long transactions.
>> Unfortunately, queue requires frequent vacuuming to preserve more or less
>> constant size of the queue and it's indexes.
>> And long transactions prevent the vacuum from cleaning up the queue.
>> Initial synchronization phase of the logical replication is in fact such a
>> transaction.
>>
>> I would recommend doing the following:
>> - avoid adding ALL tables to the publication
>> - instead, split all tables in a batches in such a way, that initial
>> batch processing takes limited time (say, 15-30 minutes at most)
>> - of course, this leaves the biggest tables alone — add those one by one
>> to the publication, preferably at the time slot with minimal load on the
>> queue.
>> - make sure to catch up on the queue processing and vacuum it between
>> batches
>> - on the receiving side, avoid creating indexes on the tables: create
>> just a necessary PK or UK, wait for the initial load to complete and then
>> add all the rest ones
>>
>> As for the queue, PGQ from skytools is using different approach to
>> maintain queue tables:
>> - once in a while (2 hours by default) processing is switched to a new
>> table, tab_1, tab_2, tab_3 are used in a round
>> - after the switch, any remaining entries can be moved from previous to
>> the live table (shouldn't be necessary if switch is done properly, although
>> might be tricky in a presence of a long transactions)
>> - previous table is TRUNCATEd
>>
>> In your case, you can do `VACUUM FULL` between replicating each batch of
>> tables.
>>
>> --
>> Victor Yegorov
>>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Dirk Mika 2020-12-10 18:33:15 Re: Set COLLATE on a session level
Previous Message Lars Vonk 2020-12-10 08:12:32 Re: Performance penalty during logical postgres replication