Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop
Date: 2020-10-10 19:06:37
Message-ID: 20201010190637.GA5774@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 2020-Sep-30, Tom Lane wrote:

> After some digging around, I realize that that commit actually
> resulted in a protocol break. libpqwalreceiver is expecting to
> get an additional CommandComplete message after COPY OUT finishes,
> per libpqrcv_endstreaming(), and it's no longer getting one.
>
> (I have not read the protocol document to see if this is per spec;
> but spec or no, that's what libpqwalreceiver is expecting.)
>
> The question that this raises is how the heck did that get past
> our test suites? It seems like the error should have been obvious
> to even the most minimal testing.

So I think I can answer this now. Petr, as intellectual author of this
code I would appreciate your thoughts on this -- you probably understand
this much better. Same goes for Peter as committer.

I think the logical replication feature is missing an detailed
architectural diagram.

What is happening is that the tablesync code, which is in charge of
initial syncing of tables, has two main code paths: the first one is an
optimization (?) in LogicalRepSyncTableStart() that if the sync takes
little time, the process will exit immediately (via
finish_sync_worker()) without returning control to the main logical
replication apply worker code. In the TAP tests, all tables are pretty
small, so that optimization always fires.

So in that case we never get control back to walrcv_endstreaming (the
complainant in Henry's test case), which is why replication never fails
in the tests. You can verify this because the final "return slotname"
line in LogicalRepSyncTableStart has zero coverage.

The other code path returns the slot name to ApplyWorkerMain, which
does the normal walrcv_startstreaming / LogicalRepApplyLoop /
endstreaming dance. And whenever that code path is taken, all the
src/test/subscription tests fail with the missing command tag, and
they work correctly when the command tag is restored, which is what
we wanted.

My proposal at this stage is to remove the optimization in
LogicalRepSyncTableStart, per 0001, if only because the code becomes
simpler (0002 reindents code). With this, we get a coverage increase
for tablesync.c not only in the final "return slotname" but also for
process_syncing_tables_for_sync() which is currently uncovered:
https://coverage.postgresql.org/src/backend/replication/logical/tablesync.c.gcov.html#268

However, and this is one reason why I'd welcome Petr/Peter thoughts on
this, I don't really understand what happens in LogicalRepApplyLoop
afterwards with a tablesync worker; are we actually doing anything
useful there, considering that the actual data copy seems to have
occurred in the CopyFrom() call in copy_table? In other words, by the
time we return control to ApplyWorkerMain with a slot name, isn't the
work all done, and the only thing we need is to synchronize protocol and
close the connection?

Or is it?

Another thing I noticed is that if I crank up the number of rows in the
new t/100_bugs.pl test case(*), the two sync workers run in sequence -- not
in parallel. What's going on with that? Shouldn't there be two
simultaneous workers?

(*) In the submitted patch, the test uses 10 million rows. I intend to
reduce that to, say, 5000 rows in the committed version. Strictly
speaking, there's no need for this test at all since with the
LogicalRepSyncTableStart changes I propose, even t/001_rep_changes.pl
fails. But since my new test case fails even without that change, I
prefer to keep it anyway.

Attachment Content-Type Size
0001-Restore-double-command-completion-tags.patch text/x-diff 7.0 KB
0002-move-declarations-and-reindent.patch text/x-diff 6.7 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2020-10-11 00:24:52 Re: BUG #15858: could not stat file - over 4GB
Previous Message Juan José Santamaría Flecha 2020-10-10 19:00:27 Re: BUG #15858: could not stat file - over 4GB