From: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com |
Subject: | Re: [HACKERS] Restricting maximum keep segments by repslots |
Date: | 2020-04-06 22:15:55 |
Message-ID: | 20200406221555.GA16211@alvherre.pgsql |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2020-Apr-06, Alvaro Herrera wrote:
> Lastly, I noticed that we're now changing the slot's restart_lsn to
> Invalid without being the slot's owner, which goes counter to what is
> said in slot.h:
>
> * - Individual fields are protected by mutex where only the backend owning
> * the slot is authorized to update the fields from its own slot. The
> * backend owning the slot does not need to take this lock when reading its
> * own fields, while concurrent backends not owning this slot should take the
> * lock when reading this slot's data.
>
> What this means is that if the slot owner walsender updates the
> restart_lsn to a newer value just as we (checkpointer) are trying to set
> it to Invalid, the owner's value might persist and our value would be
> lost.
>
> AFAICT if we were really stressed about getting this exactly correct,
> then we would have to kill the walsender, wait for it to die, then
> ReplicationSlotAcquire and *then* update
> MyReplicationSlot->data.restart_lsn.
So I had cold feet about the whole business of trying to write a
non-owned replication slot, so I tried to implemented the "exactly
correct" idea above. That's v25 here.
I think there's a race condition in this: if we kill a walsender and it
restarts immediately before we (checkpoint) can acquire the slot, we
will wait for it to terminate on its own. Fixing this requires changing
the ReplicationSlotAcquire API so that it knows not to wait but not
raise error either (so we can use an infinite loop: "acquire, if busy
send signal")
I also include a separate diff for a change that might or might not be
necessary, where xmins reserved by slots with restart_lsn=invalid are
ignored. I'm not yet sure that we should include this, but we should
keep an eye on it.
--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
v25-0001-Add-WAL-relief-vent-for-replication-slots.patch | text/x-diff | 23.7 KB |
ignore-xmin-invalidated.patch | text/x-diff | 947 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2020-04-06 22:16:37 | Re: ERROR: invalid input syntax for type circle |
Previous Message | Tomas Vondra | 2020-04-06 22:13:54 | Re: [PATCH] Incremental sort (was: PoC: Partial sort) |