RE: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.

From: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
To: "sytoptimisprime(at)163(dot)com" <sytoptimisprime(at)163(dot)com>
Cc: "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: BUG #18267: Logical replication bug: data is not synchronized after Alter Publication.
Date: 2024-01-05 03:54:52
Message-ID: TY3PR01MB9889E9DB1AC80C2DEC3B37D6F5662@TY3PR01MB9889.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dear Song,

>
> Hi hackers, I found when insert plenty of data into a table, and add the
> table to publication (through Alter Publication) meanwhile, it's likely that
> the incremental data cannot be synchronized to the subscriber. Here is my
> test method:

Good catch.

> 1. On publisher and subscriber, create table for test:
> CREATE TABLE tab_1 (a int);
>
> 2. Setup logical replication:
> on publisher:
> SELECT pg_create_logical_replication_slot('slot1', 'pgoutput', false,
> false);
> CREATE PUBLICATION tap_pub;
> on subscriber:
> CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr'
> PUBLICATION
> tap_pub WITH (enabled = true, create_slot = false, slot_name='slot1')
>
> 3. Perform Insert:
> for (my $i = 1; $i <= 1000; $i++) {
> $node_publisher->safe_psql('postgres', "INSERT INTO tab_1 SELECT
> generate_series(1, 1000)");
> }
> Each transaction contains 1000 insertion, and 1000 transactions are in
> total.
>
> 4. When performing step 3, add table tab_1 to publication.
> ALTER PUBLICATION tap_pub ADD TABLE tab_1
> ALTER SUBSCRIPTION tap_sub REFRESH PUBLICATION

I could reproduce the failure. PSA the script.

In the script, ALTER PUBLICATION was executed while doing the initial data sync.
(The workload is almost same as what you reporter posted, but number of rows are reduced)

In total, 4000 tuples are inserted on publisher. However, after sometime, only 2500 tuples are replicated.

```
publisher=# SELECT count(*) FROM tab_1 ;
count
-------
40000
(1 row)

subscriber=# SELECT count(*) FROM tab_1 ;
count
-------
25000
(1 row)
```

Is it same failure you saw?

> The root cause of the problem is as follows:
> pgoutput relies on the invalidation mechanism to validate publications. When
> walsender decoding an Alter Publication transaction, catalog caches are
> invalidated at once. Furthermore, since pg_publication_rel is modified,
> snapshot changes are added to all transactions currently being decoded. For
> other transactions, catalog caches have been invalidated. However, it is
> likely that the snapshot changes have not yet been decoded. In pgoutput
> implementation, these transactions query the system table pg_publication_rel
> to determine whether to publish changes made in transactions. In this case,
> catalog tuples are not found because snapshot has not been updated. As a
> result, changes in transactions are considered not to be published, and
> subsequent data cannot be synchronized.
>
> I think it's necessary to add invalidations to other transactions after
> adding a snapshot change to them.
> Therefore, I submitted a patch for this bug.

I cannot see your attaching, but I found that proposed patch in [1] can solve
the issue. After applying 0001 + 0002 + 0003 (open relations as ShareRowExclusiveLock,
in OpenTableList), the data gap was removed. Thought?

[1]: https://www.postgresql.org/message-id/de52b282-1166-1180-45a2-8d8917ca74c6%40enterprisedb.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment Content-Type Size
test_0104.sh application/octet-stream 1.7 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message tender wang 2024-01-05 03:55:48 Fwd: BUG #18259: Assertion in ExtendBufferedRelLocal() fails after no-space-left condition
Previous Message Andrei Lepikhov 2024-01-05 03:32:26 Re: BUG #18261: Inconsistent results of SELECT affected by joined subqueries