Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date: 2020-11-25 05:49:51
Message-ID: CAHut+Ps2W7KP46zFQ9ybEckJoRKpT6zh6aPj31L2RV9qBcoLRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Nov 23, 2020 at 8:43 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

> Peter, can you also please once test the attached and see if this
> fixes the problem for you as well?

I have reviewed the v3 patch code, and repeated the tablesync testing
(for a stream-enabled SUBSCRIPTION) using the debugging technique
previously described [1].
https://www.postgresql.org/message-id/CAHut%2BPt4PyKQCwqzQ%3DEFF%3DbpKKJD7XKt_S23F6L20ayQNxg77A%40mail.gmail.com

Now the test was successful. The stream start/stop/commit/abort are
all being handled by the tablesync worker without errors.

===

TESTING

I performed the tablesync stream test by multiple different ways of
inserting the en masse data:

1. Cause streaming (using a PREPARE TRANSACTION)
------------------------------------------------
psql -d test_pub -c "BEGIN;INSERT INTO test_tab SELECT i, md5(i::text)
FROM generate_series(3, 5000) s(i);PREPARE TRANSACTION
'test_prepared_tab';"
psql -d test_pub -c "COMMIT PREPARED 'test_prepared_tab';"

tablesync does 'S' (stream start)
tablesync does 'R'
tablesync does 'I' (insert) x n
tablesync does 'E' (stream end)
tablesync does 'S'
tablesync does 'I' x n
tablesync does 'E'
tablesync does { 'S', 'I' x n, 'E' } repeats 13 more times
tablesync does 'c' (stream commit)
- In handle_stream_commit the tablesync worker processes the spool
file, calling apply_dispatch for all the messages
- 'R'
- { 'I' + should_apply_changes_for_rel = true } x 5000
- then calls process_syncing_tables_for_sync, which sets the state to
SUBREL_STATE_SYNCDONE
- then tablesync worker process exits

Now the "apply" worker catches up.
- it runs all the same messages again but does nothing because
should_apply_changes_for_rel = false

2. Cause streaming (not using a PREPARE TRANSACTION)
----------------------------------------------------
psql -d test_pub -c "INSERT INTO test_tab SELECT i, md5(i::text) FROM
generate_series(3, 5000) s(i);"

gives very similar results to above

3. Causing a stream abort to happen
-----------------------------------
psql -d test_pub -c "BEGIN; INSERT INTO test_tab SELECT i,
md5(i::text) FROM generate_series(3, 5000) s(i); INSERT INTO test_tab
VALUES(4999, 'already exists'); END;"

tablesync does 'S' (stream start)
tablesync does 'E' (stream end)
tablesync does 'A' (stream abort)
tablesync does 'B'
tablesync does 'C'
- calls process_syncing_tables
- tablesync worker process exits
apply worker continues...

===

REVIEW COMMENTS

Despite the tests working OK I thought there should be a couple small
changes made for this patch, as follows.

(1) process_sync_tables should be last

IMO the process_syncing_tables should be always the *last* thing any
apply handler does, because it seems a bad idea for the worker to
change state (e.g. to say it is SUBREL_STATE_SYNCDONE) if in fact
there is still a little bit more processing remaining. This is why I
made the same modification in the other WIP patch v26-0006 [2]
[2] https://www.postgresql.org/message-id/CAHut%2BPuQcLbjrBXM54%2By%2B%3DYsmEDFd3CCtp31K-_jS4Xka2vwbQ%40mail.gmail.com

So, I think the process_syncing_tables() call should be put *after*
the stream_cleanup_files() in function apply_handle_stream_commit().
But to do this means the process_syncing_tables has to be removed from
your new apply_handle_commit_internal function.

(2) misleading comment

/*
* Start a transaction on stream start, this transaction will be committed
* on the stream stop. We need the transaction for handling the buffile,
* used for serializing the streaming data and subxact info.
*/

That above comment (in the apply_handle_stream_start function) may now
be inaccurate/misleading for the case of tablesync worker, because I
think for tablesync you are explicitly avoiding the tx commit within
the apply_handle_stream_stop().

---

Kind Regards,
Peter Smith.
Fujitsu Australia

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David Turoň 2020-11-25 07:46:32 Re: BUG #16743: psql doesn't show whole expression in stored column
Previous Message Manoj Kumar 2020-11-25 04:39:37 Re: BUG #16739: Temporary files not deleting from data folder on disk