From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com> |
Subject: | Re: Column Filtering in Logical Replication |
Date: | 2022-03-20 03:11:40 |
Message-ID: | CAA4eK1JzzoE61CY1qi9Vcdi742JFwG4YA3XpoMHwfKNhbFic6g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 3/18/22 15:43, Tomas Vondra wrote:
> >>
> >
> > Hmmm. So the theory is that in most runs we manage to sync the tables
> > faster than starting the workers, so we don't hit the limit. But on some
> > machines the sync worker takes a bit longer, we hit the limit. Seems
> > possible, yes. Unfortunately we don't seem to log anything when we hit
> > the limit, so hard to say for sure :-( I suggest we add a WARNING
> > message to logicalrep_worker_launch or something. Not just because of
> > this test, it seems useful in general.
> >
> > However, how come we don't retry the sync? Surely we don't just give up
> > forever, that'd be a pretty annoying behavior. Presumably we just end up
> > sleeping for a long time before restarting the sync worker, somewhere.
> >
>
> I tried lowering the max_sync_workers_per_subscription to 1 and making
> the workers to run for a couple seconds (doing some CPU intensive
> stuff), but everything still works just fine.
>
Did the apply worker restarts during that time? If not you can try by
changing some subscription parameters which leads to its restart. This
has to happen before copy_table has finished. In the LOGS, you should
see the message: "logical replication apply worker for subscription
"<subscription_name>" will restart because of a parameter change".
IIUC, the code which doesn't allow to restart the apply worker after
the max_sync_workers_per_subscription is reached is as below:
logicalrep_worker_launch()
{
...
if (nsyncworkers >= max_sync_workers_per_subscription)
{
LWLockRelease(LogicalRepWorkerLock);
return;
}
...
}
This happens before we allocate a worker to apply. So, it can happen
only during the restart of the apply worker because we always first
the apply worker, so in that case, it will never restart.
> Looking a bit closer at the logs (from pogona and other), I doubt this
> is about hitting the max_sync_workers_per_subscription limit. Notice we
> start two sync workers, but neither of them ever completes. So we never
> update the sync status or start syncing the remaining tables.
>
I think they are never completed because they are in a sort of
infinite loop. If you see process_syncing_tables_for_sync(), it will
never mark the status as SUBREL_STATE_SYNCDONE unless apply worker has
set it to SUBREL_STATE_CATCHUP. In LogicalRepSyncTableStart(), we do
wait for a state change to catchup via wait_for_worker_state_change(),
but we bail out in that function if the apply worker has died. After
that tablesync worker won't be able to complete because in our case
apply worker won't be able to restart.
> So the question is why those two sync workers never complete - I guess
> there's some sort of lock wait (deadlock?) or infinite loop.
>
It would be a bit tricky to reproduce this even if the above theory is
correct but I'll try it today or tomorrow.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2022-03-20 03:48:15 | Hardening heap pruning code (was: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum) |
Previous Message | Amit Langote | 2022-03-20 02:58:39 | Re: a misbehavior of partition row movement (?) |