From: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
---|---|
To: | Petr Jelinek <petr(at)2ndquadrant(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Henry Hinze <henry(dot)hinze(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Subject: | Re: BUG #16643: PG13 - Logical replication - initial startup never finishes and gets stuck in startup loop |
Date: | 2020-10-14 01:12:07 |
Message-ID: | 20201014011207.GA18985@alvherre.pgsql |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 2020-Oct-12, Petr Jelinek wrote:
> It's not only about size of the tables, it's mainly that there is no write
> activity so the main apply is not moving past the LSN at which table sync
> has started at. With bigger table you get at least something written
> (running xacts, autovacuum, or whatever) that moves lsn forward eventually.
I see -- yeah, okay.
> > However, and this is one reason why I'd welcome Petr/Peter thoughts on
> > this, I don't really understand what happens in LogicalRepApplyLoop
> > afterwards with a tablesync worker; are we actually doing anything
> > useful there, considering that the actual data copy seems to have
> > occurred in the CopyFrom() call in copy_table? In other words, by the
> > time we return control to ApplyWorkerMain with a slot name, isn't the
> > work all done, and the only thing we need is to synchronize protocol and
> > close the connection?
>
> There are 2 possible states at that point, either tablesync is ahead (when
> main apply lags or nothing is happening on publication side) or it's behind
> the main apply. When tablesync is ahead we are indeed done and just need to
> update the state of the table (which is what the code you removed did, but
> LogicalRepApplyLoop should do it as well, just a bit later). When it's
> behind we need to do catchup for that table only which still happens in the
> tablesync worker. See the explanation at the beginning of tablesync.c, it
> probably needs some small adjustments after the changes in your first patch.
... Ooh, things start to make some sense now. So how about the
attached? There are some not really related cleanups. (Changes to
protocol.sgml are still pending.)
If I understand correcly, the early exit in tablesync.c is not saving *a
lot* of time (we don't actually skip replaying any WAL), even if it's
saving execution of a bunch of code. So I stand by my position that
removing the code is better because it's clearer about what is actually
happening.
Attachment | Content-Type | Size |
---|---|---|
0001-Restore-logical-replication-dupe-command-tags.patch | text/x-diff | 3.1 KB |
0002-Review-logical-replication-tablesync-code.patch | text/x-diff | 15.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro Horiguchi | 2020-10-14 03:05:10 | Re: BUG #16663: DROP INDEX did not free up disk space: idle connection hold file marked as deleted |
Previous Message | Tom Lane | 2020-10-13 14:29:19 | Re: BUG #16665: Segmentation fault |