From: | hubert depesz lubaczewski <depesz(at)depesz(dot)com> |
---|---|
To: | PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Logical replication is missing block of rows when sending initial sync? |
Date: | 2023-10-27 13:33:13 |
Message-ID: | ZTu8GTDajCkZVjMs@depesz.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
we have situation where we use native logical replication to copy data
from pg12 (on ubuntu bionic) to pg14 (on ubuntu focal).
Generally it mostly works.
The process is like this:
1. make all tables in destination, with *just* primary keys. no other
indexes, constraints
2. setup logical replication, adding tables in batches by 40, adding new
batch to publication only when all tables are in 'r' state
3. once all tables are replicating, add indexes, fkeys, trgiggers and
the like on recipient.
Every so often we got errors when adding fkeys. Usually I didn't have
much time to investigate, but today had a bit. Unfortunatley the replica
is gone (I had to re-set the replication). But when I had it there
I noticed that a block of rows from one of the tables was missing.
I am not entirely sure about the size of the block, but I check 20 rows
with consecutive ids, and they were all missing from focal side:
#v+
=# select ctid, xmin, id, created_at, updated_at from c33s16831.subs where id between 19733130 and 19733150 order by 1;
ctid │ xmin │ id │ created_at │ updated_at
─────────────┼────────────┼──────────┼────────────────────────────┼────────────────────────────
(486337,57) │ 2668461379 │ 19733130 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,58) │ 2668461379 │ 19733131 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,59) │ 2668461379 │ 19733132 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,60) │ 2668461379 │ 19733133 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,61) │ 2668461379 │ 19733134 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,62) │ 2668461379 │ 19733135 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,63) │ 2668461379 │ 19733136 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,64) │ 2668461379 │ 19733137 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,65) │ 2668461379 │ 19733138 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,66) │ 2668461379 │ 19733139 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486337,67) │ 2668461379 │ 19733140 │ 2023-10-25 19:11:41.926204 │ 2023-10-25 19:11:41.926204
(486340,38) │ 2668691763 │ 19733147 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:07.360778
(486340,58) │ 2668695225 │ 19733146 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:15.227267
(486340,61) │ 2668711141 │ 19733144 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:43.499749
(486340,62) │ 2668717802 │ 19733149 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:20:49.900798
(486340,63) │ 2668723876 │ 19733141 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:00.749445
(486340,64) │ 2668753420 │ 19733142 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:02.140332
(486340,66) │ 2668761678 │ 19733150 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:22:19.140231
(486341,9) │ 2668726256 │ 19733145 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:06.845051
(486341,17) │ 2668748364 │ 19733148 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:50.246899
(486341,18) │ 2668750613 │ 19733143 │ 2023-10-25 19:19:06.299059 │ 2023-10-25 19:21:55.863125
(21 rows)
#v-
Last vacuum (from autovacuum) was ~ 3 weeks ago, so it isn't a factor.
I don't know when exactly this batch was added to publication, but I found that COPY that sent initial sync of the data was logged at:
#v+
2023-10-25 18:48:43.982 UTC,"upguser","dbname",12995,"10.1.207.7:44956",65396265.32c3,6,"COPY",2023-10-25 18:45:57 UTC,74/160580191,0,LOG,00000,"duration: 166355.331 ms statement: COPY c33s16831.subs TO STDOUT",,,,,,,,,"pg_443910492_sync_443582288_7292150788357777097"
#v-
Unfortunately I don't have a way to repeat the problem. Just noticed
that on some dbs that we're upgrading the problem is more common, and on
some - it generally doesn't happen.
So, I don't have high hopes re: finding/fixing the issue, but figured
I'll let you know that there is such situation, perhaps it will strike
someone as something relating to something that could be the problem...
Best regards,
depesz
From | Date | Subject | |
---|---|---|---|
Next Message | Richard Guo | 2023-10-27 14:10:31 | Re: BUG #18170: Unexpected error: no relation entry for relid 3 |
Previous Message | PG Bug reporting form | 2023-10-27 11:02:14 | BUG #18172: High memory usage in tSRF function context |