From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
Subject: | Re: Replication slot stats misgivings |
Date: | 2021-03-25 06:05:51 |
Message-ID: | CAD21AoAZ8aPHmCY+rcKRcB6680qUsgNU5f+X=w31xjM6WgDHVw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Mar 24, 2021 at 7:06 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Mar 23, 2021 at 10:54 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > On 2021-03-23 23:37:14 +0900, Masahiko Sawada wrote:
> >
> > > > > Maybe we can compare the slot name in the
> > > > > received message to the name in the element of replSlotStats. If they
> > > > > don’t match, we swap entries in replSlotStats to synchronize the index
> > > > > of the replication slot in ReplicationSlotCtl->replication_slots and
> > > > > replSlotStats. If we cannot find the entry in replSlotStats that has
> > > > > the name in the received message, it probably means either it's a new
> > > > > slot or the previous create message is dropped, we can create the new
> > > > > stats for the slot. Is that what you mean, Andres?
> >
> > That doesn't seem great. Slot names are imo a poor identifier for
> > something happening asynchronously. The stats collector regularly
> > doesn't process incoming messages for periods of time because it is busy
> > writing out the stats file. That's also when messages to it are most
> > likely to be dropped (likely because the incoming buffer is full).
> >
>
> Leaving aside restart case, without some sort of such sanity checking,
> if both drop (of old slot) and create (of new slot) messages are lost
> then we will start accumulating stats in old slots. However, if only
> one of them is lost then there won't be any such problem.
>
> > Perhaps we could have RestoreSlotFromDisk() send something to the stats
> > collector ensuring the mapping makes sense?
> >
>
> Say if we send just the index location of each slot then probably we
> can setup replSlotStats. Now say before the restart if one of the drop
> messages was missed (by stats collector) and that happens to be at
> some middle location, then we would end up restoring some already
> dropped slot, leaving some of the still required ones. However, if
> there is some sanity identifier like name along with the index, then I
> think that would have worked for such a case.
Even such messages could also be lost? Given that any message could be
lost under a UDP connection, I think we cannot rely on a single
message. Instead, I think we need to loosely synchronize the indexes
while assuming the indexes in replSlotStats and
ReplicationSlotCtl->replication_slots are not synchronized.
>
> I think it would have been easier if we would have some OID type of
> identifier for each slot. But, without that may be index location of
> ReplicationSlotCtl->replication_slots and slotname combination can
> reduce the chances of slot stats go wrong quite less even if not zero.
> If not name, do we have anything else in a slot that can be used for
> some sort of sanity checking?
I don't see any useful information in a slot for sanity checking.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro Horiguchi | 2021-03-25 06:31:10 | Re: psql lacking clearerr() |
Previous Message | Tom Lane | 2021-03-25 05:50:29 | Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb |