From: | Jehan-Guillaume de Rorthais <jgdr(at)dalibo(dot)com> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com |
Subject: | Re: [HACKERS] Restricting maximum keep segments by repslots |
Date: | 2020-01-22 16:47:23 |
Message-ID: | 20200122174723.4a26ca12@firost |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
First, it seems you did not reply to Alvaro's concerns in your new set of
patch. See:
https://www.postgresql.org/message-id/20190917195800.GA16694%40alvherre.pgsql
On Tue, 24 Dec 2019 21:26:14 +0900 (JST)
Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
[...]
> > Indeed, "loosing" is a better match for this state.
> >
> > However, what's the point of this state from the admin point of view? In
> > various situation, the admin will have no time to react immediately and fix
> > whatever could help.
> >
> > How useful is this specific state?
>
> If we assume "losing" segments as "lost", a segment once "lost" can
> return to "keeping" or "streaming" state. That is intuitively
> impossible. On the other hand if we assume it as "keeping", it should
> not be removed by the next checkpoint but actually it can be
> removed. The state "losing" means such a unstable state different from
> both "lost" and "keeping".
OK, indeed.
But I'm still unconfortable with this "unstable" state. It would be better if
we could grab a stable state: either "keeping" or "lost".
> > + <entry>Availability of WAL records claimed by this
> > + slot. <literal>streaming</literal>, <literal>keeping</literal>,
> >
> > Slots are keeping WALs, not WAL records. Shouldn't it be "Availability of
> > WAL files claimed by this slot"?
>
> I choosed "record" since a slot points a record. I'm not sure but I'm
> fine with "file". Fixed catalogs.sgml and config.sgml that way.
Thanks.
> > + <literal>streaming</literal> means that the claimed records are
> > + available within max_wal_size. <literal>keeping</literal> means
> >
> > I wonder if streaming is the appropriate name here. The WALs required might
> > be available for streaming, but the slot not active, thus not "streaming".
> > What about merging with the "active" field, in the same fashion as
> > pg_stat_activity.state? We would have an enum "pg_replication_slots.state"
> > with the following states:
> > * inactive: non active slot
> > * active: activated, required WAL within max_wal_size
> > * keeping: activated, max_wal_size is exceeded but still held by replication
> > slots or wal_keep_segments.
> > * lost: some WAL are definitely lost
> >
> > Thoughts?
>
> In the first place, I realized that I was missed a point about the
> relationship between max_wal_size and max_slot_wal_keep_size
> here. Since the v15 of this patch, GetLsnAvailablity returns
> "streaming" when the restart_lsn is within max_wal_size. That behavior
> makes sense when max_slot_wal_keep_size > max_wal_size. However, in
> the contrary case, restart_lsn could be lost even it is
> withinmax_wal_size. So we would see "streaming" (or "normal") even
> though restart_lsn is already lost. That is broken.
>
> In short, the "streaming/normal" state is useless if
> max_slot_wal_keep_size < max_wal_size.
Good catch!
> Finally I used the following wordings.
>
> (there's no "inactive" wal_state)
>
> * normal: required WAL within max_wal_size when max_slot_wal_keep_size
> is larger than max_wal_size.
> * keeping: required segments are held by replication slots or
> wal_keep_segments.
>
> * losing: required segments are about to be removed or may be already
> removed but streaming is not dead yet.
As I wrote, I'm still uncomfortable with this state. Maybe we should ask
other reviewers opinions on this.
[...]
> > WARNING: some replication slots have lost required WAL segments
> > DETAIL: Slot slot_limit_st lost 177 segment(s)
> >
> > I wonder if this is useful to show these messages for slots that were
> > already dead before this checkpoint?
>
> Makes sense. I changed KeepLogSeg so that it emits the message only on
> slot_names changes.
Thanks.
Bellow some code review.
In regard with FindOldestXLogFileSegNo(...):
> /*
> * Return the oldest WAL segment file.
> *
> * The returned value is XLogGetLastRemovedSegno() + 1 when the function
> * returns a valid value. Otherwise this function scans over WAL files and
> * finds the oldest segment at the first time, which could be very slow.
> */
> XLogSegNo
> FindOldestXLogFileSegNo(void)
The comment is not clear to me. I suppose "at the first time" might better be
expressed as "if none has been removed since last startup"?
Moreover, what about patching XLogGetLastRemovedSegno() itself instead of
adding a function?
In regard with GetLsnAvailability(...):
> /*
> * Detect availability of the record at given targetLSN.
> *
> * targetLSN is restart_lsn of a slot.
Wrong argument name. It is called "restart_lsn" in the function
declaration.
> * restBytes is the pointer to uint64 variable, to store the remaining bytes
> * until the slot goes into "losing" state.
I'm not convinced with this argument name. What about "remainingBytes"? Note
that you use remaining_bytes elsewhere in your patch.
> * -1 is stored to restBytes if the values is useless.
What about returning a true negative value when the slot is really lost?
All in all, I feel like this function is on the fence between being generic
because of its name and being slot-only oriented because of the first parameter
name, use of "max_slot_wal_keep_size_mb", returned status and "slotPtr".
I wonder if it should be more generic and stay here or move to xlogfuncs.c with
a more specific name?
> * slot limitation is not activated, WAL files are kept unlimitedlllly
"unlimitedly"? "infinitely"? "unconditionally"?
> /* it is useless for the states below */
> *restBytes = -1;
This might be set to the real bytes kept, even if status is "losing".
> * The segment is alrady lost or being lost. If the oldest segment is just
"already"
> if (oldestSeg == restartSeg + 1 && walsender_pid != 0)
> return "losing";
I wonder if this should be "oldestSeg > restartSeg"?
Many segments can be removed by the next or running checkpoint. And a running
walsender can send more than one segment in the meantime I suppose?
In regard with GetOldestKeepSegment(...):
> static XLogSegNo
> GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN,
> XLogRecPtr targetLSN, int64 *restBytes)
I wonder if minSlotLSN is really useful as a parameter or if it should be
fetched from GetOldestKeepSegment() itself? Currently,
XLogGetReplicationSlotMinimumLSN() is always called right before
GetOldestKeepSegment() just to fill this argument.
> walstate =
> GetLsnAvailability(restart_lsn, active_pid, &remaining_bytes);
I agree with Alvaro: we might want to return an enum and define the related
state string here. Or, if we accept negative remaining_bytes, GetLsnAvailability
might even only return remaining_bytes and we deduce the state directly from
here.
Regards,
From | Date | Subject | |
---|---|---|---|
Next Message | Maciek Sakrejda | 2020-01-22 16:54:40 | Re: Duplicate Workers entries in some EXPLAIN plans |
Previous Message | Alvaro Herrera | 2020-01-22 16:37:36 | Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions |