From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Cc: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, jgdr(at)dalibo(dot)com, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com |
Subject: | Re: [HACKERS] Restricting maximum keep segments by repslots |
Date: | 2020-05-17 01:00:05 |
Message-ID: | 20200517010005.jzaf2245w4rrgs2o@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2020-04-29 18:58:16 -0400, Alvaro Herrera wrote:
> On 2020-Apr-28, Alvaro Herrera wrote:
>
> > On 2020-Apr-28, Kyotaro Horiguchi wrote:
> >
> > > > Anyway I think this patch should fix it also -- instead of adding a new
> > > > flag, we just rely on the existing flags (since do_checkpoint must have
> > > > been set correctly from the flags earlier in that block.)
> > >
> > > Since the added (!do_checkpoint) check is reached with
> > > do_checkpoint=false at server start and at archive_timeout intervals,
> > > the patch makes checkpointer run a busy-loop at that timings, and that
> > > loop lasts until a checkpoint is actually executed.
> > >
> > > What we need to do here is not forgetting the fact that the latch has
> > > been set even if the latch itself gets reset before reaching to
> > > WaitLatch.
> >
> > After a few more false starts :-) I think one easy thing we can do
> > without the additional boolean flag is to call SetLatch there in the
> > main loop if we see that ckpt_flags is nonzero.
>
> I went back to "continue" instead of SetLatch, because it seems less
> wasteful, but I changed the previously "do_checkpoint" condition to
> rechecking ckpt_flags. We would not get in the busy loop in that case,
> because the condition is true when the next loop would take action and
> false otherwise. So I think this should fix the problem without causing
> any other issues. But if you do see problems with this, please let us
> know.
I don't think this is quite sufficient:
I, independent of this patch, added a few additional paths in which
checkpointer's latch is reset, and I found a few shutdowns in regression
tests to be extremely slow / timing out. The reason for that is that
the only check for interrupts is at the top of the loop. So if
checkpointer gets SIGUSR2 we don't see ShutdownRequestPending until we
decide to do a checkpoint for other reasons.
I also suspect that it could have harmful consequences to not do a
AbsorbSyncRequests() if something "ate" the set latch.
I don't think it's reasonable to expect this much code between a
ResetLatch and WaitLatch to never reset a latch. So I think we need to
make the coding more robust in face of that. Without having to duplicate
the top and the bottom of the loop.
One way to do that would be to WaitLatch() call to much earlier, and
only do a WaitLatch() if do_checkpoint is false. Roughly like in the
attached.
Greetings,
Andres Freund
Attachment | Content-Type | Size |
---|---|---|
checkpointer-latch-bug.diff | text/x-diff | 3.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2020-05-17 01:08:57 | Re: pg_stat_wal_receiver and flushedUpto/writtenUpto |
Previous Message | Michael Paquier | 2020-05-17 00:32:03 | Re: pg_bsd_indent and -Wimplicit-fallthrough |