Issue with logical replication

From: Kacey Holston <kacey(dot)holston(at)pgexperts(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Issue with logical replication
Date: 2021-08-08 20:28:28
Message-ID: 8CE7E1D1-56E6-4836-8F17-235B6FFA4EBD@pgexperts.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I ran into an issue of missing rows during an upgrade from PostgreSQL 11 to PostgreSQL 13 using in core logical replication.

## Procedure Used:

After copying over all the users and schema I ran this on the publisher:

"CREATE PUBLICATION upgrade11to13 FOR ALL TABLES;"

Then ran this on the subscriber:

"CREATE SUBSCRIPTION server13 CONNECTION ‘dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER redacted > password=<PASSWORD redacted > port=5432' PUBLICATION upgrade11to13;"

The initial copy and replication appeared to run as expected. I saw all the tables in the ready state in pg_subscription_rel. The tables pg_stat_repllication and pg_stat_subscription looked as expected as well.

We then reversed the direction of replication by running this on the PostgreSQL 13 server:

"DROP SUBSCRIPTION server13;”

And this on the PostgreSQL 11 server:

"DROP PUBLICATION upgrade11to13;”

Back on the PostgreSQL 13 server:

"CREATE PUBLICATION reverse_to_11 FOR ALL TABLES;"

And on the PostgreSQL 11 server:

"CREATE SUBSCRIPTION oldserver11 CONNECTION 'dbname=<DBNAME redacted> hostaddr=<ADDRESS redacted > user=<USER redacted > password=<PASSWORD redacted > port=5432' PUBLICATION reverse_to_11 with (copy_data=false);"

At this point unexpected query results were returning and we discovered it was from missing rows in at least one table. (We unfortunately never checked additional tables).

We were able to determine that all the rows missing were created on the publisher starting seven minutes after replication began and lasted for three hours.

This table had a date created time stamp so we checked what time period the rows were missing from. If you run a query requesting all the records created during those three hours there is a 50k record discrepancy. If you run a query requesting all the records created after those three hours all rows are in-tact.

We did not retain the logs but at the time did not see any related errors.

## Second try:

We attempted to run this again.

We created a similar, new and fresh instance. Followed the same steps as above but this time created a replication slot manually.

We received identical results. The records were missing from a handful of minutes after replication starts and last for approximately 3 hours.

We were able to note that the data loss appeared to coincide approximately with the start of the table’s copy but the data loss ended before the table finished the copy. The table took about 12 hours to copy while the data loss was only three hours.

## Logs:

On the Subscriber, during the entire initial copy there were only four, spread out errors of the type:

"could not receive data from WAL stream: SSL SYSCALL error: EOF detected”

After these calls the next line in the server indicates a new copy and for different tables.

There were no related errors on the Publisher side.

## Server information

Publisher:

Ubuntu 16.04.7 LTS
GNU/Linux 4.4.0-1114-aws x86_64

Subscriber:

Ubuntu 20.04.2
GNU/Linux 5.4.0-1045-aws x86_64

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2021-08-09 08:58:12 Re: Issue with logical replication
Previous Message Noah Misch 2021-08-08 16:37:52 Re: CREATE INDEX CONCURRENTLY does not index prepared xact's data