From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | masao(dot)fujii(at)oss(dot)nttdata(dot)com |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Review for GetWALAvailability() |
Date: | 2020-06-16 05:00:26 |
Message-ID: | 20200616.140026.1624101556608079529.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Tue, 16 Jun 2020 01:46:21 +0900, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> wrote in
> > In short, it is known behavior but it was judged as useless to prevent
> > that.
> > That can happen when checkpointer removes up to the segment that is
> > being read by walsender. I think that that doesn't happen (or
> > happenswithin a narrow time window?) for physical replication but
> > happenes for logical replication.
> > While development, I once added walsender a code to exit for that
> > reason, but finally it is moved to InvalidateObsoleteReplicationSlots
> > as a bit defferent function.
>
> BTW, I read the code of InvalidateObsoleteReplicationSlots() and
> probably
> found some issues in it.
>
> 1. Each cycle of the "for" loop in
> InvalidateObsoleteReplicationSlots()
> emits the log message "terminating walsender ...". This means that
> if it takes more than 10ms for walsender to exit after it's signaled,
> the second and subsequent cycles would happen and output the same
> log message several times. IMO that log message should be output
> only once.
Sounds reasonable.
> 2. InvalidateObsoleteReplicationSlots() uses the loop to scan
> replication
> slots array and uses the "for" loop in each scan. Also it calls
> ReplicationSlotAcquire() for each "for" loop cycle, and
> ReplicationSlotAcquire() uses another loop to scan replication slots
> array. I don't think this is good design.
>
> ISTM that we can get rid of ReplicationSlotAcquire()'s loop because
> InvalidateObsoleteReplicationSlots() already know the index of the
> slot
> that we want to find. The attached patch does that. Thought?
The inner loop is expected to run at most several times per
checkpoint, which won't be a serious problem. However, it is better if
we can get rid of that in a reasonable way.
The attached patch changes the behavior for SAB_Block. Before the
patch, it rescans from the first slot for the same name, but with the
patch it just rechecks the same slot. The only caller of the function
with SAB_Block is ReplicationSlotDrop and I don't come up with a case
where another slot with the same name is created at different place
before the condition variable fires. But I'm not sure the change is
completely safe. Maybe some assertion is needed?
> 3. There is a corner case where the termination of walsender cleans up
> the temporary replication slot while
> InvalidateObsoleteReplicationSlots()
> is sleeping on ConditionVariableTimedSleep(). In this case,
> ReplicationSlotAcquire() is called in the subsequent cycle of the
> "for"
> loop, cannot find the slot and then emits ERROR message. This leads
> to the failure of checkpoint by the checkpointer.
Agreed.
> To avoid this case, if SAB_Inquire is specified,
> ReplicationSlotAcquire()
> should return the special value instead of emitting ERROR even when
> it cannot find the slot. Also InvalidateObsoleteReplicationSlots()
> should
> handle that special returned value.
I thought the same thing hearing that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2020-06-16 05:31:13 | Re: language cleanups in code and docs |
Previous Message | Ashutosh Bapat | 2020-06-16 05:00:20 | Re: factorial of negative numbers |