From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Fujii Masao <fujii(at)postgresql(dot)org> |
Subject: | Re: Race condition in InvalidateObsoleteReplicationSlots() |
Date: | 2021-04-08 02:09:13 |
Message-ID: | 20210408020913.zzprrlvqyvlt5cyy@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2021-04-07 17:10:37 -0700, Andres Freund wrote:
> I think this can be solved in two different ways:
>
> 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
> InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
> slot in the to-be-obsoleted-slot's place.
>
> 2) Atomically check whether the slot needs to be invalidated and try to
> acquire if needed. Don't release ReplicationSlotControlLock between those
> two steps. Signal the owner to release the slot iff we couldn't acquire the
> slot. In the latter case wait and then recheck if the slot still needs to
> be dropped.
>
> To me 2) seems better, because we then can also be sure that the slot still
> needs to be obsoleted, rather than potentially doing so unnecessarily.
>
>
> It looks to me like several of the problems here stem from trying to reuse
> code from ReplicationSlotAcquireInternal() (which before this was just named
> ReplicationSlotAcquire()). I don't think that makes sense, because cases like
> this want to check if a condition is true, and acquire it only if so.
>
> IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
> except that a different condition is checked, and the if (active_pid) case
> needs to prepare a condition variable, signal the owner and then wait on the
> condition variable, to restart after.
I'm also confused by the use of ConditionVariableTimedSleep(timeout =
10). Why do we need a timed sleep here in the first place? And why with
such a short sleep?
I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
but is aware it's running in checkpointer. I don't think CFI does much
there? If we are worried about needing to check for interrupts, more
work is needed.
Sketch for a fix attached. I did leave the odd
ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's
there...
After this I don't see a reason to have SAB_Inquire - as far as I can
tell it's practically impossible to use without race conditions? Except
for raising an error - which is "builtin"...
Greetings,
Andres Freund
Attachment | Content-Type | Size |
---|---|---|
0001-WIP-Sketch-for-a-fix-for-InvalidateObsoleteReplicati.patch | text/x-diff | 7.6 KB |
0002-WIP-Remove-SlotAcquireBehavior-again.patch | text/x-diff | 7.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Bharath Rupireddy | 2021-04-08 02:09:48 | Re: Why is specifying oids = false multiple times in create table is silently ignored? |
Previous Message | Tom Lane | 2021-04-08 02:00:08 | Re: psql \df choose functions by their arguments |