Re: Logical replication is missing block of rows when sending initial sync?

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, kuroda(dot)hayato(at)fujitsu(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Logical replication is missing block of rows when sending initial sync?
Date: 2023-11-10 11:44:17
Message-ID: ZU4XkdOPCthiyN7L@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Nov 09, 2023 at 08:14:46PM +0100, Tomas Vondra wrote:
>
>
> On 11/8/23 15:02, hubert depesz lubaczewski wrote:
> >> OK. found another case (maybe two).
> >
> > OK, the other case is real. Different database. Same scenario.
> >
> > Errors (3 because we have 3 missing fkeys):
> >
> > #v+
> > ALTER TABLE ONLY c44s16539.assignment_overrides ADD CONSTRAINT fk_rails_58f8ee369b FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id);
> > ERROR: insert or update on table "assignment_overrides" violates foreign key constraint "fk_rails_58f8ee369b"
> > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments".
> >
> > ALTER TABLE ONLY c44s16539.post_policies ADD CONSTRAINT fk_rails_cf2d119863 FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id);
> > ERROR: insert or update on table "post_policies" violates foreign key constraint "fk_rails_cf2d119863"
> > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments".
> >
> > ALTER TABLE ONLY c44s16539.submissions ADD CONSTRAINT fk_rails_61cac0823d FOREIGN KEY (assignment_id) REFERENCES c44s16539.assignments(id);
> > ERROR: insert or update on table "submissions" violates foreign key constraint "fk_rails_61cac0823d"
> > DETAIL: Key (assignment_id)=(248083) is not present in table "assignments".
> > #v-
> >
> > In all cases the problem is missing c44s16539.assignments with is = 248083.
> >
> > This time it looks that only one row is missing:
> >
> > #v+
> > Pg14# select max(id) from c44s16539.assignments where id < 248083;
> > max
> > ────────
> > 248082
> > (1 row)
> >
> > Pg14# select min(id) from c44s16539.assignments where id > 248083;
> > min
> > ────────
> > 248084
> > (1 row)
> > #v-
> >
> > Schema is complicated. 76 columns. On pg12 side one column is dropped.
> > 22 indexes, 9 fkeys from this table (assignments) to others, ~ 35 fkeys
> > pointing to this table (but only 3 have rows that reference the
> > missing record.
> >
> > Some informationa bout the record:
> >
> > #v+
> > Pg12# select ctid, xmin, cmin, xmax, cmax, id, created_at, updated_at from c44s16539.assignments where id = 248083 \gx
> > ─[ RECORD 1 ]──────────────────────────
> > ctid │ (24192,1)
> > xmin │ 472291499
> > cmin │ 12
> > xmax │ 472291499
> > cmax │ 12
> > id │ 248083
> > created_at │ 2023-11-07 15:50:05.506077
> > updated_at │ 2023-11-07 15:51:28.738893
> > #v-
> >
> > logs for when it was added to replication:
> >
> > #v+
> > 2023-11-07 15:47:27.898 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,39,"ALTER PUBLICATION",2023-11-07 15:47:27 UTC,6/0,0,LOG,00000,"duration: 1.476 ms",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.899 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,40,"idle",2023-11-07 15:47:27 UTC,6/1209097693,0,LOG,00000,"statement: ALTER PUBLICATION focal14 ADD TABLE ONLY c247s1094.messages_2022_50;",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,41,"ALTER PUBLICATION",2023-11-07 15:47:27 UTC,6/0,0,LOG,00000,"duration: 1.493 ms",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.900 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,42,"idle",2023-11-07 15:47:27 UTC,6/1209097694,0,LOG,00000,"statement: ALTER PUBLICATION focal14 ADD TABLE ONLY c44s16539.assignments;",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,43,"ALTER PUBLICATION",2023-11-07 15:47:27 UTC,6/0,0,LOG,00000,"duration: 1.936 ms",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.902 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,44,"idle",2023-11-07 15:47:27 UTC,6/1209097695,0,LOG,00000,"statement: ALTER PUBLICATION focal14 ADD TABLE ONLY c39s5131.messages_2023_20;",,,,,,,,,"psql"
> > 2023-11-07 15:47:27.903 UTC,"postgres","dbname",2333,"[local]",654a5c0f.91d,45,"ALTER PUBLICATION",2023-11-07 15:47:27 UTC,6/0,0,LOG,00000,"duration: 1.292 ms",,,,,,,,,"psql"
> > #v-
> >
> > It is interesting that the time when the row was inserted (created_at) is
> > always "around" the time that the table was added to publication...
> >
> > Just like with the other case, I can leave it in current state for
> > a day, so if you'd need more information from it, let me know.
> >
>
> I think it'd be interesting to look at the WAL using pg_waldump, and see
> how it relates to the LSN used for the tablesync slot.

well, I can try to dump, and show data, but it will most likely be
a lot. Are you more interested in looking into wal around the second
where data was inserted, or when table was added to publication?

> I'll try reproducing this locally over the weekend. Should I use the
> test_1030.sh script that you shared a week ago, or do I need to do
> something more?

That question is probably not to me, but to Hayato Kuroda

> One thing that puzzles me is that IIUC the script doesn't generate any
> changes while adding the tables to the subscription? I probably should
> have paid more attention to the details, instead of assuming there are
> concurrent changes - which seemed like there might be some race, or
> something. But this seems like there are no changes and the tablesync
> just fails to copy some subset of it. That's *really* strange.

This script was made by Hayato Kuroda, as far as I can tell. Our
situation is not like this.

In our case there is constant traffic on the publisher from normally
working app.

Best regards,

depesz

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message hubert depesz lubaczewski 2023-11-10 13:08:10 Re: Logical replication is missing block of rows when sending initial sync?
Previous Message PG Bug reporting form 2023-11-10 11:07:20 BUG #18189: Value partitioned tables: Upd ERROR: more than one row returned by a subquery used as an expression