From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
Cc: | "sytoptimisprime(at)163(dot)com" <sytoptimisprime(at)163(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication. |
Date: | 2024-01-08 05:13:52 |
Message-ID: | CAFiTN-vsdWgthGJFOG74E94LAi5E5DmP0Ag616V62hftHq6Ldw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Fri, Jan 5, 2024 at 9:25 AM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> Dear Song,
>
> >
> > Hi hackers, I found when insert plenty of data into a table, and add the
> > table to publication (through Alter Publication) meanwhile, it's likely that
> > the incremental data cannot be synchronized to the subscriber. Here is my
> > test method:
>
> Good catch.
>
> > 1. On publisher and subscriber, create table for test:
> > CREATE TABLE tab_1 (a int);
> >
> > 2. Setup logical replication:
> > on publisher:
> > SELECT pg_create_logical_replication_slot('slot1', 'pgoutput', false,
> > false);
> > CREATE PUBLICATION tap_pub;
> > on subscriber:
> > CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr'
> > PUBLICATION
> > tap_pub WITH (enabled = true, create_slot = false, slot_name='slot1')
> >
> > 3. Perform Insert:
> > for (my $i = 1; $i <= 1000; $i++) {
> > $node_publisher->safe_psql('postgres', "INSERT INTO tab_1 SELECT
> > generate_series(1, 1000)");
> > }
> > Each transaction contains 1000 insertion, and 1000 transactions are in
> > total.
> >
> > 4. When performing step 3, add table tab_1 to publication.
> > ALTER PUBLICATION tap_pub ADD TABLE tab_1
> > ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION
>
> I could reproduce the failure. PSA the script.
>
> In the script, ALTER PUBLICATION was executed while doing the initial data sync.
> (The workload is almost same as what you reporter posted, but number of rows are reduced)
>
> In total, 4000 tuples are inserted on publisher. However, after sometime, only 2500 tuples are replicated.
>
> ```
> publisher=# SELECT count(*) FROM tab_1 ;
> count
> -------
> 40000
> (1 row)
>
> subscriber=# SELECT count(*) FROM tab_1 ;
> count
> -------
> 25000
> (1 row)
> ```
>
> Is it same failure you saw?
With your attached script I was able to see this gap, I didn't dig
deeper but with the initial investigation, I could see that even after
ALTER PUBLICATION, the pgoutput_change continues to see
'relentry->pubactions.pubinsert' as false, even after re fetching the
relation entry after the invalidation. That shows the invalidation
framework might be working fine but we are using the older snapshot to
fetch the entry. I did not debug it further why it is not getting the
updated snapshot which can see the change in publication, because I
assume Yutao Song as already analyzed that as per his first email, so
I would wait for his patch.
> > The root cause of the problem is as follows:
> > pgoutput relies on the invalidation mechanism to validate publications. When
> > walsender decoding an Alter Publication transaction, catalog caches are
> > invalidated at once. Furthermore, since pg_publication_rel is modified,
> > snapshot changes are added to all transactions currently being decoded. For
> > other transactions, catalog caches have been invalidated. However, it is
> > likely that the snapshot changes have not yet been decoded. In pgoutput
> > implementation, these transactions query the system table pg_publication_rel
> > to determine whether to publish changes made in transactions. In this case,
> > catalog tuples are not found because snapshot has not been updated. As a
> > result, changes in transactions are considered not to be published, and
> > subsequent data cannot be synchronized.
> >
> > I think it's necessary to add invalidations to other transactions after
> > adding a snapshot change to them.
> > Therefore, I submitted a patch for this bug.
>
> I cannot see your attaching, but I found that proposed patch in [1] can solve
> the issue. After applying 0001 + 0002 + 0003 (open relations as ShareRowExclusiveLock,
> in OpenTableList), the data gap was removed. Thought?
Not sure why 'open relations as ShareRowExclusiveLock' would help in
this case? have you investigated that?
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Lakhin | 2024-01-08 08:00:00 | Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger |
Previous Message | vignesh C | 2024-01-08 05:02:03 | Re: "unexpected duplicate for tablespace" problem in logical replication |