From: | Peter Smith <smithpb2250(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Subject: | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Date: | 2020-11-17 02:14:23 |
Message-ID: | CAHut+Psprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz+AA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Tue, Nov 17, 2020 at 1:07 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> Yeah, this seems to be possible and this is the reason I mentioned
> above to dig more into this case. Did you try it via some test case? I
> think we can generate a test via debugger where after the tablesync
> worker reaches CATCHUP state stop it via debugger, then we can perform
> some large transaction on the same table which apply worker will skip
> and tablesync worker will try to apply changes and should fail.
Hello Amit.
FYI - This email is just to confirm that your above idea for debugging
the tablesync worker does work.
I have made a small temporary patch to aid testing this: PSA.
The patch just adds some more LOGs - it helps see what is going on.
The patch also gives a 30 second opportunity to attach to the
tablesync worker process.
----
So the necessary debugging steps are like this:
0. Start debugger
- Later we will attach to the tablesync worker process so it is
convenient to set all the breakpoint in advance
(gdb) b LogicalRepSyncTableStart
(gdb) b LogicalRepApplyLoop
(gdb) b apply_dispatch
(gdb) b process_syncing_tables_for_sync
1. Pub/Sub nodes: Create/Start Publisher and Subscriber nodes with
same test table.
2. Pub node: Add some initial data to the table (to give tablesync
init something to copy)
3. Pub node: CREATE PUBLISHER
4. Sub node: CREATE SUBSCRIBER
- at this point the slot gets created and the tablesync worker and
apply worker are launched
- that patch causes the tablesync worker to sleep 30 seconds to give
opportunity to attach to it in debugger
5. Debugger/tablesync: Attach to the tablesync worker
- continue
- breaks at LogicalRepSyncTableStart
- step over all the initial COPY code to the end of the function
(where apply worker has told it to CATCHUP)
6. Pub: While the tablesync worker is paused in debugger do some other
operations on the original table
- e.g. psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');"
7. Debugger/tablesync: Allow the tablesync worker to continue
- tablesync worker will return from LogicalRepSyncTableStart and then
into LogicalRepApplyLoop
- if tablesync worker detects it is "behind" then the
LogicalRepApplyLoop will call apply_dispatch to deal with the message
resulting from the step 6 operation.
8. Debugger/tablesync: Some apply handlers (e.g. apply_handle_commit)
if not the LogicalRepApplyLoop itself, will end up calling the
process_syncing_tables.
- which calls process_syncing_tables_for_sync
- which will cause the tablesync worker to set SYNCDONE status just
before killing itself.
9. Sub node: The apply worker wait loop will now proceed again
tick/tick/tick every 1 second to process any incoming replication
messages.
---
Using the above technique, you can cause a tablesync worker to enter
some of the normal "apply" handling which might otherwise be very hard
to reproduce outside of the debugger.
---
Here is an example of some of the logging for the above debugging scenario:
2020-11-17 10:46:11.325 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:12.328 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:12.328 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:13.337 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:13.337 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:14.340 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:14.340 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:15.155 AEDT [9841] LOG: !!>> tablesync worker: About
to call LogicalRepSyncTableStart to do initial syncing
2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:15.343 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:16.349 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:16.349 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:17.351 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:17.351 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.353 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.353 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.650 AEDT [10734] LOG: logical decoding found
consistent point at 0/1B6D5E0
2020-11-17 10:46:18.650 AEDT [10734] DETAIL: There are no running transactions.
2020-11-17 10:46:18.650 AEDT [10734] STATEMENT:
CREATE_REPLICATION_SLOT "tap_sub_24599_sync_16385" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
2020-11-17 10:46:18.658 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.658 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:46:18.744 AEDT [9841] LOG: !!>> tablesync worker: wait
for CATCHUP state notification
2020-11-17 10:46:18.744 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:46:18.744 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 'foo');"
INSERT 0 1
[postgres(at)CentOS7-x64 ~]$ 2020-11-17 10:47:39.269 AEDT [9841] LOG:
!!>> tablesync worker: received CATCHUP state notification
2020-11-17 10:47:44.117 AEDT [9841] LOG: !!>> tablesync worker:
Returned from LogicalRepSyncTableStart
2020-11-17 10:48:05.678 AEDT [10734] LOG: starting logical decoding
for slot "tap_sub_24599_sync_16385"
2020-11-17 10:48:05.678 AEDT [10734] DETAIL: Streaming transactions
committing after 0/1B6D618, reading WAL from 0/1B6D5E0.
2020-11-17 10:48:05.678 AEDT [10734] STATEMENT: START_REPLICATION
SLOT "tap_sub_24599_sync_16385" LOGICAL 0/1B6D618 (proto_version '2',
publication_names '"tap_pub"')
2020-11-17 10:48:05.678 AEDT [10734] LOG: logical decoding found
consistent point at 0/1B6D5E0
2020-11-17 10:48:05.678 AEDT [10734] DETAIL: There are no running transactions.
2020-11-17 10:48:05.678 AEDT [10734] STATEMENT: START_REPLICATION
SLOT "tap_sub_24599_sync_16385" LOGICAL 0/1B6D618 (proto_version '2',
publication_names '"tap_pub"')
2020-11-17 10:48:09.258 AEDT [9841] LOG: !!>> tablesync worker:
LogicalRepApplyLoop
2020-11-17 10:49:36.537 AEDT [9841] LOG: !!>> tablesync worker:
called process_syncing_tables
2020-11-17 10:49:36.538 AEDT [9841] LOG: logical replication table
synchronization worker for subscription "tap_sub", table "test_tab"
has finished
2020-11-17 10:49:36.588 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:36.589 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:36.590 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:37.591 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:37.591 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:38.592 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:38.592 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:39.594 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:39.594 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
2020-11-17 10:49:40.595 AEDT [9838] LOG: !!>> apply worker: LogicalRepApplyLoop
2020-11-17 10:49:40.595 AEDT [9838] LOG: !!>> apply worker: called
process_syncing_tables
---
I have so far only been trying above with the non-streaming
subscription, and TBH stepping through this startup state "dance" of
the tablesync/apply workers is already causing more questions than
answers for me. Anyway, I will raise any questions as separate emails
to this one.
BTW, do you think these tablesync discussions should be moved to
hackers list? I think they are no longer related to the reported BUG
of this current thread.
Kind Regards,
Peter Smith.
Fujitsu Australia
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Degugging-tablesync-worker.patch | application/octet-stream | 3.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2020-11-17 08:58:53 | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Previous Message | PG Bug reporting form | 2020-11-16 18:15:20 | BUG #16722: PG hanging on COPY when table has close to 2^32 toasts in the table. |