Quick Links

Re: [HACKERS] Restricting maximum keep segments by repslots

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, jgdr(at)dalibo(dot)com, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject:	Re: [HACKERS] Restricting maximum keep segments by repslots
Date:	2020-05-17 01:00:05
Message-ID:	20200517010005.jzaf2245w4rrgs2o@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2020-04-29 18:58:16 -0400, Alvaro Herrera wrote:
> On 2020-Apr-28, Alvaro Herrera wrote:
>
> > On 2020-Apr-28, Kyotaro Horiguchi wrote:
> >
> > > > Anyway I think this patch should fix it also -- instead of adding a new
> > > > flag, we just rely on the existing flags (since do_checkpoint must have
> > > > been set correctly from the flags earlier in that block.)
> > >
> > > Since the added (!do_checkpoint) check is reached with
> > > do_checkpoint=false at server start and at archive_timeout intervals,
> > > the patch makes checkpointer run a busy-loop at that timings, and that
> > > loop lasts until a checkpoint is actually executed.
> > >
> > > What we need to do here is not forgetting the fact that the latch has
> > > been set even if the latch itself gets reset before reaching to
> > > WaitLatch.
> >
> > After a few more false starts :-) I think one easy thing we can do
> > without the additional boolean flag is to call SetLatch there in the
> > main loop if we see that ckpt_flags is nonzero.
>
> I went back to "continue" instead of SetLatch, because it seems less
> wasteful, but I changed the previously "do_checkpoint" condition to
> rechecking ckpt_flags. We would not get in the busy loop in that case,
> because the condition is true when the next loop would take action and
> false otherwise. So I think this should fix the problem without causing
> any other issues. But if you do see problems with this, please let us
> know.

I don't think this is quite sufficient:
I, independent of this patch, added a few additional paths in which
checkpointer's latch is reset, and I found a few shutdowns in regression
tests to be extremely slow / timing out. The reason for that is that
the only check for interrupts is at the top of the loop. So if
checkpointer gets SIGUSR2 we don't see ShutdownRequestPending until we
decide to do a checkpoint for other reasons.

I also suspect that it could have harmful consequences to not do a
AbsorbSyncRequests() if something "ate" the set latch.

I don't think it's reasonable to expect this much code between a
ResetLatch and WaitLatch to never reset a latch. So I think we need to
make the coding more robust in face of that. Without having to duplicate
the top and the bottom of the loop.

One way to do that would be to WaitLatch() call to much earlier, and
only do a WaitLatch() if do_checkpoint is false. Roughly like in the
attached.

Greetings,

Andres Freund

Attachment	Content-Type	Size
checkpointer-latch-bug.diff	text/x-diff	3.3 KB

In response to

Re: [HACKERS] Restricting maximum keep segments by repslots at 2020-04-29 22:58:16 from Alvaro Herrera

Responses

Re: [HACKERS] Restricting maximum keep segments by repslots at 2020-05-17 02:51:50 from Alvaro Herrera

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2020-05-17 01:08:57	Re: pg_stat_wal_receiver and flushedUpto/writtenUpto
Previous Message	Michael Paquier	2020-05-17 00:32:03	Re: pg_bsd_indent and -Wimplicit-fallthrough