Quick Links

Re: long-standing data loss bug in initial sync of logical replication

From:	Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	vignesh C <vignesh21(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nitin Motiani <nitinmotiani(at)google(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: long-standing data loss bug in initial sync of logical replication
Date:	2024-08-30 09:35:48
Message-ID:	CANhcyEUhYvw1ZaXB+c4jJSnwP7hPQ_XCSif-xmjm0oca6RbETw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> BTW, we should do some performance testing by having a mix of DML and
> DDLs to see the performance impact of this patch.
>
> [1] - https://www.postgresql.org/message-id/CAD21AoAenVqiMjpN-PvGHL1N9DWnHSq673bfgr6phmBUzx=kLQ@mail.gmail.com
>

I did some performance testing and I found some performance impact for
the following case:

1. Created a publisher, subscriber set up on a single table, say 'tab_conc1';
2. Created a second publisher, subscriber set on a single table say 'tp';
3. Created 'tcount' no. of tables. These tables are not part of any publication.
4. There are two sessions running in parallel, let's say S1 and S2.
5. Begin a transaction in S1.
6. Now in a loop (this loop runs 100 times):
S1: Insert a row in table 'tab_conc1'
S1: Insert a row in all 'tcount' tables.
S2: BEGIN; Alter publication for 2nd publication; COMMIT;
The current logic in the patch will call the function
'rel_sync_cache_publication_cb' during invalidation. This will
invalidate the cache for all the tables. So cache related to all the
tables i.e. table 'tab_conc1', 'tcount' tables will be invalidated.
7. COMMIT the transaction in S1.

The performance in this case is:
No. of tables | With patch (in ms) | With head (in ms)
-----------------------------------------------------------------------------
tcount = 100 | 101376.4 | 101357.8
tcount = 1000 | 994085.4 | 993471.4

For 100 tables the performance is slow by '0.018%' and for 1000 tables
performance is slow by '0.06%'.
These results are the average of 5 runs.

Other than this I tested the following cases but did not find any
performance impact:
1. with 'tcount = 10'. But I didn't find any performance impact.
2. with 'tcount = 0' and running the loop 1000 times. But I didn't
find any performance impact.

I have also attached the test script and the machine configurations on
which performance testing was done.
Next I am planning to test solely on the logical decoding side and
will share the results.

Thanks and Regards,
Shlok Kyal

Attachment	Content-Type	Size
os_info.txt	text/plain	549 bytes
cpu_info.txt	text/plain	1.2 KB
memory_info.txt	text/plain	1.3 KB
test.pl	application/octet-stream	3.1 KB

In response to

Re: long-standing data loss bug in initial sync of logical replication at 2024-08-20 10:40:22 from Amit Kapila

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Eisentraut	2024-08-30 09:49:43	Re: [PoC] Federated Authn/z with OAUTHBEARER
Previous Message	Bertrand Drouvot	2024-08-30 09:02:03	Re: Add contrib/pg_logicalsnapinspect