From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com> |
Cc: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Review for GetWALAvailability() |
Date: | 2020-06-16 18:50:47 |
Message-ID: | 20200616185047.GA19008@alvherre.pgsql |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2020-Jun-17, Fujii Masao wrote:
> While reading InvalidateObsoleteReplicationSlots() code, I found another issue.
>
> ereport(LOG,
> (errmsg("terminating walsender %d because replication slot \"%s\" is too far behind",
> wspid, NameStr(slotname))));
> (void) kill(wspid, SIGTERM);
>
> wspid indicates the PID of process using the slot. That process can be
> a backend, for example, executing pg_replication_slot_advance().
> So "walsender" in the above log message is not always correct.
Good point.
> int wspid = ReplicationSlotAcquire(NameStr(slotname),
> SAB_Inquire);
>
> Why do we need to call ReplicationSlotAcquire() here and mark the slot as
> used by the checkpointer? Isn't it enough to check directly the slot's
> active_pid, instead?
>
> Maybe ReplicationSlotAcquire() is necessary because
> ReplicationSlotRelease() is called later? If so, why do we need to call
> ReplicationSlotRelease()? ISTM that we don't need to do that if the slot's
> active_pid is zero. No?
I think the point here was that in order to modify the slot you have to
acquire it -- it's not valid to modify a slot you don't own.
> + /*
> + * Signal to terminate the process using the replication slot.
> + *
> + * Try to signal every 100ms until it succeeds.
> + */
> + if (!killed && kill(active_pid, SIGTERM) == 0)
> + killed = true;
> + ConditionVariableTimedSleep(&slot->active_cv, 100,
> + WAIT_EVENT_REPLICATION_SLOT_DROP);
> + } while (ReplicationSlotIsActive(slot, NULL));
Note that here you're signalling only once and then sleeping many times
in increments of 100ms -- you're not signalling every 100ms as the
comment claims -- unless the signal fails, but you don't really expect
that. On the contrary, I'd claim that the logic is reversed: if the
signal fails, *then* you should stop signalling.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2020-06-16 18:59:19 | Re: global barrier & atomics in signal handlers (Re: Atomic operations within spinlocks) |
Previous Message | Alvaro Herrera | 2020-06-16 18:31:43 | Re: Review for GetWALAvailability() |