RE: long-standing data loss bug in initial sync of logical replication

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Benoit Lobréau <benoit(dot)lobreau(at)dalibo(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: RE: long-standing data loss bug in initial sync of logical replication
Date: 2025-03-03 07:41:08
Message-ID: OS0PR01MB571616A2C303FCED3CF3D67C94C92@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, February 28, 2025 4:28 PM Benoit Lobréau <benoit(dot)lobreau(at)dalibo(dot)com> wrote:
>
> It took me a while but I ran the test on my laptop with 20 runs per test. I asked
> for a dedicated server and will re-run the tests if/when I have it.
>
> count of partitions | Head (sec) | Fix (sec) | Degradation (%)
> ----------------------------------------------------------------------
> 1000 | 0,0265 | 0,028 | 5,66037735849054
> 5000 | 0,091 | 0,0945 | 3,84615384615385
> 10000 | 0,1795 | 0,1815 | 1,11420612813371
>
> Concurrent Txn | Head (sec) | Patch (sec) | Degradation in %
> ---------------------------------------------------------------------
> 50 | 0,1797647 | 0,1920949 | 6,85907744957
> 100 | 0,3693029 | 0,3823425 | 3,53086856344
> 500 | 1,62265755 | 1,91427485 | 17,97158617972
> 1000 | 3,01388635 | 3,57678295 | 18,67676928162
> 2000 | 7,0171877 | 6,4713304 | 8,43500897435
>
> I'll try to run test2.pl later (right now it fails).
>
> hope this helps.

Thank you for testing and sharing the data!

A nitpick with the data for the Concurrent Transaction (2000) case. The results
show that the HEAD's data appears worse than the patch data, which seems
unusual. However, I confirmed that the details in the attachment are as expected,
so, this seems to be a typo. (I assume you intended to use a
decimal point instead of a comma in the data like (8,43500...))

The data suggests some regression, slightly more than Shlok’s findings, but it
is still within an acceptable range for me. Since the test script builds a real
subscription for testing, the results might be affected by network and
replication factors, as Amit pointed out, we will share a new test script soon
that uses the SQL API xxx_get_changes() to test. It would be great if you could
verify the performance using the updated script as well.

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tender Wang 2025-03-03 07:57:31 Re: Anti join confusion
Previous Message Jakub Wartak 2025-03-03 07:35:58 Re: doc: Mention clock synchronization recommendation for hot_standby_feedback