Re: long-standing data loss bug in initial sync of logical replication

From: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: long-standing data loss bug in initial sync of logical replication
Date: 2024-09-09 05:11:36
Message-ID: CANhcyEVOz9sGFNfPEAuBuWXuFg70W3kco-n-CuZnarkdqC=FMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2 Sept 2024 at 10:12, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Aug 30, 2024 at 3:06 PM Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com> wrote:
> >
> > Next I am planning to test solely on the logical decoding side and
> > will share the results.
> >
>
> Thanks, the next set of proposed tests makes sense to me. It will also
> be useful to generate some worst-case scenarios where the number of
> invalidations is more to see the distribution cost in such cases. For
> example, Truncate/Drop a table with 100 or 1000 partitions.
>
> --
> With Regards,
> Amit Kapila.

Hi,

I did some performance testing solely on the logical decoding side and
found some degradation in performance, for the following testcase:
1. Created a publisher on a single table, say 'tab_conc1';
2. Created a second publisher on a single table say 'tp';
4. two sessions are running in parallel, let's say S1 and S2.
5. Begin a transaction in S1.
6. Now in a loop (this loop runs 'count' times):
S1: Insert a row in table 'tab_conc1'
S2: BEGIN; Alter publication DROP/ ADD tp; COMMIT
7. COMMIT the transaction in S1.
8. run 'pg_logical_slot_get_binary_changes' to get the decoding changes.

Observation:
With fix a new entry is added in decoding. During debugging I found
that this entry only comes when we do a 'INSERT' in Session 1 after we
do 'ALTER PUBLICATION' in another session in parallel (or we can say
due to invalidation). Also, I observed that this new entry is related
to sending replica identity, attributes,etc as function
'logicalrep_write_rel' is called.

Performance:
We see a performance degradation as we are sending new entries during
logical decoding. Results are an average of 5 runs.

count | Head (sec) | Fix (sec) | Degradation (%)
------------------------------------------------------------------------------
10000 | 1.298 | 1.574 | 21.26348228
50000 | 22.892 | 24.997 | 9.195352088
100000 | 88.602 | 93.759 | 5.820410374

I have also attached the test script here.

Thanks and Regards,
Shlok Kyal

Attachment Content-Type Size
test2.pl application/octet-stream 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2024-09-09 05:21:43 Re: long-standing data loss bug in initial sync of logical replication
Previous Message shveta malik 2024-09-09 04:58:42 Re: Introduce XID age and inactive timeout based replication slot invalidation