Re: Review for GetWALAvailability()

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: masao(dot)fujii(at)oss(dot)nttdata(dot)com
Cc: alvherre(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Review for GetWALAvailability()
Date: 2020-06-17 08:30:58
Message-ID: 20200617.173058.579037591051032616.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Wed, 17 Jun 2020 17:01:11 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
>
>
> On 2020/06/17 12:10, Kyotaro Horiguchi wrote:
> > At Tue, 16 Jun 2020 22:40:56 -0400, Alvaro Herrera
> > <alvherre(at)2ndquadrant(dot)com> wrote in
> >> On 2020-Jun-17, Fujii Masao wrote:
> >>> On 2020/06/17 3:50, Alvaro Herrera wrote:
> >>
> >>> So InvalidateObsoleteReplicationSlots() can terminate normal backends.
> >>> But do we want to do this? If we want, we should add the note about
> >>> this
> >>> case into the docs? Otherwise the users would be surprised at
> >>> termination
> >>> of backends by max_slot_wal_keep_size. I guess that it's basically
> >>> rarely
> >>> happen, though.
> >>
> >> Well, if we could distinguish a walsender from a non-walsender
> >> process,
> >> then maybe it would make sense to leave backends alive. But do we
> >> want
> >> that? I admit I don't know what would be the reason to have a
> >> non-walsender process with an active slot, so I don't have a good
> >> opinion on what to do in this case.
> > The non-walsender backend is actually doing replication work. It
> > rather should be killed?
>
> I have no better opinion about this. So I agree to leave the logic as
> it is
> at least for now, i.e., we terminate the process owning the slot
> whatever
> the type of process is.

Agreed.

> >>>>> + /*
> >>>>> + * Signal to terminate the process using the replication slot.
> >>>>> + *
> >>>>> + * Try to signal every 100ms until it succeeds.
> >>>>> + */
> >>>>> + if (!killed && kill(active_pid, SIGTERM) == 0)
> >>>>> + killed = true;
> >>>>> + ConditionVariableTimedSleep(&slot->active_cv, 100,
> >>>>> + WAIT_EVENT_REPLICATION_SLOT_DROP);
> >>>>> + } while (ReplicationSlotIsActive(slot, NULL));
> >>>>
> >>>> Note that here you're signalling only once and then sleeping many
> >>>> times
> >>>> in increments of 100ms -- you're not signalling every 100ms as the
> >>>> comment claims -- unless the signal fails, but you don't really expect
> >>>> that. On the contrary, I'd claim that the logic is reversed: if the
> >>>> signal fails, *then* you should stop signalling.
> >>>
> >>> You mean; in this code path, signaling fails only when the target
> >>> process
> >>> disappears just before signaling. So if it fails, slot->active_pid is
> >>> expected to become 0 even without signaling more. Right?
> >>
> >> I guess kill() can also fail if the PID now belongs to a process owned
> >> by a different user.
>
> Yes. This case means that the PostgreSQL process using the slot
> disappeared
> and the same PID was assigned to non-PostgreSQL process. So if kill()
> fails
> for this reason, we don't need to kill() again.
>
> > I think we've disregarded very quick reuse of
> >> PIDs, so we needn't concern ourselves with it.
> > The first time call to ConditionVariableTimedSleep doen't actually
> > sleep, so the loop works as expected. But we may make an extra call
> > to kill(2). Calling ConditionVariablePrepareToSleep beforehand of the
> > loop would make it better.
>
> Sorry I failed to understand your point...

My point is the ConditionVariableTimedSleep does *not* sleep on the CV
first time in this usage. The new version anyway avoids useless
kill(2) call, but still may make an extra call to
ReplicationSlotAcquireInternal. I think we should call
ConditionVariablePrepareToSleep before the sorrounding for statement
block.

> Anyway, the attached is the updated version of the patch. This fixes
> all the issues in InvalidateObsoleteReplicationSlots() that I reported
> upthread.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2020-06-17 08:53:31 Re: language cleanups in code and docs
Previous Message Pavel Stehule 2020-06-17 08:23:44 Re: calling procedures is slow and consumes extra much memory against calling function