From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Subject: | Re: Failure in subscription test 004_sync.pl |
Date: | 2021-06-14 05:11:14 |
Message-ID: | CAD21AoBWqUPeXbZviHWZhdZVMP3vop_k43ipzFwVa=emE7M3Og@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jun 12, 2021 at 9:57 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> >
> > wrasse has just failed with what looks like a timing error with a
> > replication slot drop:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30
> >
> > Here is the error:
> > error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication
> > slot "tap_sub" on publisher: ERROR: replication slot "tap_sub" is
> > active for PID 1641'
> >
> > It seems to me that this just lacks a poll_query_until() doing some
> > slot monitoring?
> >
>
> I think it is showing a race condition issue in the code. In
> DropSubscription, we first stop the worker that is receiving the WAL,
> and then in a separate connection with the publisher, it tries to drop
> the slot which leads to this error. The reason is that walsender is
> still active as we just wait for wal receiver (or apply worker) to
> stop. Normally, as soon as the apply worker is stopped the walsender
> detects it and exits but in this case, it took some time to exit, and
> in the meantime, we tried to drop the slot which is still in use by
> walsender.
There might be possible.
That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT
command with WAIT option. I found a bug that is possibly an oversight
of commit 1632ea4368. The commit changed the code around the error as
follows:
if (active_pid != MyProcPid)
{
- if (behavior == SAB_Error)
+ if (!nowait)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_IN_USE),
errmsg("replication slot \"%s\" is active for PID %d",
NameStr(s->data.name), active_pid)));
- else if (behavior == SAB_Inquire)
- return active_pid;
/* Wait here until we get signaled, and then restart */
ConditionVariableSleep(&s->active_cv,
The condition should be the opposite; we should raise the error when
'nowait' is true. I think this is the cause of the test failure. Even
if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
don't wait but raise the error.
Attached a small patch fixes it.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
Attachment | Content-Type | Size |
---|---|---|
fix_slot_drop.patch | application/octet-stream | 406 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Drouvot, Bertrand | 2021-06-14 05:41:12 | Re: Minimal logical decoding on standbys |
Previous Message | Shinya11.Kato | 2021-06-14 04:53:58 | RE: [PATCH] expand the units that pg_size_pretty supports on output |