Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Keisuke Kuroda <keisuke(dot)kuroda(dot)3862(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
Date: 2020-09-26 10:58:44
Message-ID: CAA4eK1Lb3sY8TEfQrtZ8ceeHy3=Z-H=dsYcbjWnYonD=e8EvHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 23, 2020 at 1:09 PM Keisuke Kuroda
<keisuke(dot)kuroda(dot)3862(at)gmail(dot)com> wrote:
>
> Hi hackers,
>
> I found a problem in logical replication.
> It seems to have the same cause as the following problem.
>
> Creating many tables gets logical replication stuck
> https://www.postgresql.org/message-id/flat/20f3de7675f83176253f607b5e199b228406c21c.camel%40cybertec.at
>
> Logical decoding CPU-bound w/ large number of tables
> https://www.postgresql.org/message-id/flat/CAHoiPjzea6N0zuCi%3D%2Bf9v_j94nfsy6y8SU7-%3Dbp4%3D7qw6_i%3DRg%40mail.gmail.com
>
> # problem
>
> * logical replication enabled
> * walsender process has RelfilenodeMap cache(2000 relations in this case)
> * TRUNCATE or DROP or CREATE many tables in same transaction
>
> At this time, walsender process continues to use 100% of the CPU 1core.
>
...
...
>
> ./src/backend/replication/logical/reorderbuffer.c
> 1746 case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
> 1747 Assert(change->data.command_id != InvalidCommandId);
> 1748
> 1749 if (command_id < change->data.command_id)
> 1750 {
> 1751 command_id = change->data.command_id;
> 1752
> 1753 if (!snapshot_now->copied)
> 1754 {
> 1755 /* we don't use the global one anymore */
> 1756 snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> 1757 txn, command_id);
> 1758 }
> 1759
> 1760 snapshot_now->curcid = command_id;
> 1761
> 1762 TeardownHistoricSnapshot(false);
> 1763 SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
> 1764
> 1765 /*
> 1766 * Every time the CommandId is incremented, we could
> 1767 * see new catalog contents, so execute all
> 1768 * invalidations.
> 1769 */
> 1770 ReorderBufferExecuteInvalidations(rb, txn);
> 1771 }
> 1772
> 1773 break;
>
> Do you have any solutions?
>

Yeah, I have an idea on how to solve this problem. This problem is
primarily due to the reason that we use to receive invalidations only
at commit time and then we need to execute them after each command id
change. However, after commit c55040ccd0 (When wal_level=logical,
write invalidations at command end into WAL so that decoding can use
this information.) we actually know exactly when we need to execute
each invalidation. The idea is that instead of collecting
invalidations only in ReorderBufferTxn, we need to collect them in
form of ReorderBufferChange as well similar to what we do for other
changes (for ex. REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID). In this
case, we need to collect additionally in ReorderBufferTxn because if
the transaction is aborted or some exception occurred while executing
the changes we need to perform all the invalidations.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2020-09-26 14:20:32 Avoid suspects casts VARHDRSZ (c.h)
Previous Message Etsuro Fujita 2020-09-26 10:45:39 Re: Asynchronous Append on postgres_fdw nodes.