From: | Victor Yegorov <vyegorov(at)gmail(dot)com> |
---|---|
To: | Lars Vonk <lars(dot)vonk(at)gmail(dot)com> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: Performance penalty during logical postgres replication |
Date: | 2020-12-09 17:45:02 |
Message-ID: | CAGnEbogTv9LHwZqRQszYHsxTSPjWsxu5SmMoBvff02MYLzLXEg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
ср, 9 дек. 2020 г. в 10:21, Lars Vonk <lars(dot)vonk(at)gmail(dot)com>:
> We are doing a logical postgres replication from Postgres 11 to 12. Our
> database is around 700GB (8 cpu's, 32 GB).
> During the replication process, at some point, we see a huge performance
> penalty on a particular table. This table acts as a queue with lots of
> inserts and deletes happening throughout the day. For most of the time this
> table is empty, but during this performance penalty the number of rows in
> this table grows to 10.000 rows, and processing is not fast enough to empty
> this table. Main reason for this (as far as we see) is that the performance
> of the query for selecting the next row to process drops from < 10MS to
> 400MS. This eventually causes too much cpu load on the Primary and we have
> to cancel the replication process.
>
We already tried the initial load three times, and it consistently fails
> with the same "error". Last try was a per table approach and excluding this
> "queue" table.
> After cancelling the replication the query is fast again and the load on
> the Primary goes back to normal. We see that this happens when replicating
> large tables (> millions of rows). During this performance penalty the
> explain of the query selecting the next row from this table tells us it is
> doing a sequential scan (there is an index but it is not used).
>
> - What could cause this performance penalty?
> - Is this something other people experienced as well during the initial
> load of a logical replication with large tables?
> - We are now thinking of temporarily increasing the number of CPU's and
> RAM for the migration. Would this help in this case?
>
I've seen similar symptoms in cases with (a) home-made queues in the tables
and (b) long transactions.
Unfortunately, queue requires frequent vacuuming to preserve more or less
constant size of the queue and it's indexes.
And long transactions prevent the vacuum from cleaning up the queue.
Initial synchronization phase of the logical replication is in fact such a
transaction.
I would recommend doing the following:
- avoid adding ALL tables to the publication
- instead, split all tables in a batches in such a way, that initial batch
processing takes limited time (say, 15-30 minutes at most)
- of course, this leaves the biggest tables alone — add those one by one to
the publication, preferably at the time slot with minimal load on the queue.
- make sure to catch up on the queue processing and vacuum it between
batches
- on the receiving side, avoid creating indexes on the tables: create just
a necessary PK or UK, wait for the initial load to complete and then add
all the rest ones
As for the queue, PGQ from skytools is using different approach to maintain
queue tables:
- once in a while (2 hours by default) processing is switched to a new
table, tab_1, tab_2, tab_3 are used in a round
- after the switch, any remaining entries can be moved from previous to the
live table (shouldn't be necessary if switch is done properly, although
might be tricky in a presence of a long transactions)
- previous table is TRUNCATEd
In your case, you can do `VACUUM FULL` between replicating each batch of
tables.
--
Victor Yegorov
From | Date | Subject | |
---|---|---|---|
Next Message | Rich Shepard | 2020-12-09 17:53:41 | Re: User input to queries |
Previous Message | Laurenz Albe | 2020-12-09 17:35:44 | Re: User input to queries |