long-standing data loss bug in initial sync of logical replication

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: long-standing data loss bug in initial sync of logical replication
Date: 2023-11-17 14:36:25
Message-ID: de52b282-1166-1180-45a2-8d8917ca74c6@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

It seems there's a long-standing data loss issue related to the initial
sync of tables in the built-in logical replication (publications etc.).
I can reproduce it fairly reliably, but I haven't figured out all the
details yet and I'm a bit out of ideas, so I'm sharing what I know with
the hope someone takes a look and either spots the issue or has some
other insight ...

On the pgsql-bugs, Depesz reported reported [1] cases where tables are
added to a publication but end up missing rows on the subscriber. I
didn't know what might be the issue, but given his experience I decided
to take a do some blind attempts to reproduce the issue.

I'm not going to repeat all the details from the pgsql-bugs thread, but
I ended up writing a script that does randomized stress test tablesync
under concurrent load. Attached are two scripts, where crash-test.sh
does the main work, while run.sh drives the test - executes
crash-test.sh in a loop and generates random parameters for it.

The run.sh generates number of tables, refresh interval (after how many
tables we refresh subscription) and how long to sleep between steps (to
allow pgbench to do more work).

The crash-test.sh then does this:

1) initializes two clusters (expects $PATH to have pg_ctl etc.)

2) configures them for logical replication (wal_level, ...)

3) creates publication and subscription on the nodes

4) creates some a bunch of tables

5) starts a pgbench that inserts data into the tables

6) adds the tables to the publication one by one, occasionally
refreshing the subscription

7) waits for tablesync of all the tables to complete (so that the
tables get into the 'r' state, thus replicating normally)

8) stops the pgbench

9) waits for the subscriber to fully catch up

10) compares that the tables on publisher/subscriber nodes

To run this, just make sure PATH includes pg, and do e.g.

./run.sh 10

which does 10 runs of crash-test.sh with random parameters. Each run can
take a couple minutes, depending on the parameters, hardware etc.

Obviously, we expect the tables to match on the two nodes, but the
script regularly detects cases where the subscriber is missing some of
the rows. The script dumps those tables, and the rows contain timestamps
and LSNs to allow "rough correlation" (imperfect thanks to concurrency).

Depesz reported "gaps" in the data, i.e. missing a chunk of data, but
then following rows seemingly replicated. I did see such cases too, but
most of the time I see a missing chunk of rows at the end (but maybe if
the test continued a bit longer, it'd replicate some rows).

The report talks about replication between pg12->pg14, but I don't think
the cross-version part is necessary - I'm able to reproduce the issue on
individual versions (e.g. 12->12) since 12 (I haven't tried 11, but I'd
be surprised if it wasn't affected too).

The rows include `pg_current_wal_lsn()` to roughly track the LSN where
the row is inserted, and the "gap" of missing rows for each table seems
to match pg_subscription_rel.srsublsn, i.e. the LSN up to which
tablesync copied data, and the table should be replicated as usual.

Another interesting observation is that the issue only happens for "bulk
insert" transactions, i.e.

BEGIN;
... INSERT into all tables ...
COMMIT;

but not when each insert is a separate transaction. A bit strange.

After quite a bit of debugging, I came to the conclusion this happens
because we fail to invalidate caches on the publisher, so it does not
realize it should start sending rows for that table.

In particular, we initially build RelationSyncEntry when the table is
not yet included in the publication, so we end up with pubinsert=false,
thus not replicating the inserts. Which makes sense, but we then seems
to fail to invalidate the entry after it's added to the publication.

The other problem is that even if we happen to invalidate the entry, we
call GetRelationPublications(). But even if it happens long after the
table gets added to the publication (both in time and LSN terms), it
still returns NIL as if the table had no publications. And we end up
with pubinsert=false, skipping the inserts again.

Attached are three patches against master. 0001 adds some debug logging
that I found useful when investigating the issue. 0002 illustrates the
issue by forcefully invalidating the entry for each change, and
implementing a non-syscache variant of the GetRelationPublication().
This makes the code unbearably slow, but with both changes in place I
can no longer reproduce the issue. Undoing either of the two changes
makes it reproducible again. (I'll talk about 0003 later.)

I suppose timing matters, so it's possible it gets "fixed" simply
because of that, but I find that unlikely given the number of runs I did
without observing any failure.

Overall, this looks, walks and quacks like a cache invalidation issue,
likely a missing invalidation somewhere in the ALTER PUBLICATION code.
If we fail to invalidate the pg_publication_rel syscache somewhere, that
obviously explain why GetRelationPublications() returns stale data, but
it would also explain why the RelationSyncEntry is not invalidated, as
that happens in a syscache callback.

But I tried to do various crazy things in the ALTER PUBLICATION code,
and none of that worked, so I'm a bit confused/lost.

However, while randomly poking at different things, I realized that if I
change the lock obtained on the relation in OpenTableList() from
ShareUpdateExclusiveLock to ShareRowExclusiveLock, the issue goes away.
I don't know why it works, and I don't even recall what exactly led me
to the idea of changing it.

This is what 0003 does - it reverts 0002 and changes the lock level.

AFAIK the logical decoding code doesn't actually acquire locks on the
decoded tables, so why would this change matter? The only place that
does lock the relation is the tablesync, which gets RowExclusiveLock on
it. And it's interesting that RowExclusiveLock does not conflict with
ShareUpdateExclusiveLock, but does with ShareRowExclusiveLock. But why
would this even matter, when the tablesync can only touch the table
after it gets added to the publication?

regards

[1] https://www.postgresql.org/message-id/ZTu8GTDajCkZVjMs@depesz.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
0001-debug-logging.patch text/x-patch 8.4 KB
0002-experimental-fix.patch text/x-patch 2.8 KB
0003-alternative-experimental-fix-lock.patch text/x-patch 3.4 KB
crash-test.sh application/x-shellscript 8.8 KB
run.sh application/x-shellscript 1.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-11-17 18:18:28 Re: On non-Windows, hard depend on uselocale(3)
Previous Message Aleksander Alekseev 2023-11-17 14:34:37 Re: POC: Extension for adding distributed tracing - pg_tracing