Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, rootcause000(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Date: 2023-10-11 16:59:39
Message-ID: CA+TgmoZFjh3Ui9n77quMS278si_sWJNnRkLKDzzXDS-0v=SYFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Oct 6, 2023 at 12:50 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> If you trace the checkpointer's system calls you will see that
> base/5/16384 (or whatever t's relfilenode is for you) is *not* fsync'd
> by checkpoint #2. The following checkpoint #3 might eventually do it,
> but if the kernel loses power after checkpoint #2 completes and there
> is no checkpoint #3, the kernel might forget the truncation, and yet
> replay starts too late to redo it. I think that bad sequence looks
> like this:
>
> P1: log truncate
> P2: choose redo LSN
> P1: DropRelationBuffers()
> P2: CheckPointBuffers()
> P2: ProcessSyncRequests()
> P1: ftruncate()
> P1: RegisterSyncRequest()
> P2: log checkpoint
> *** system loses power ***
>
> I realise it is a different problem than the one reported, but it's
> close. My initial thought is that perhaps we shouldn't allow a redo
> LSN to be chosen until the sync request is registered, which is also
> fairly close to the critical section boundaries being discussed for
> ftruncate() error case. But that's not a phase the checkpoint delay
> machinery currently knows how to delay. And there may well be better
> ways...

Hmm. So the problem in this case is that we not only need the WAL to
be inserted and the filesystem operation to be performed in the same
checkpoint cycle, but we also need the sync request to be processed in
that checkpoint cycle. In this example, the WAL insertion happens
before the redo LSN, so the truncate on disk must happen before the
checkpoint completes, which is guaranteed, and the sync request must
also be processed before the checkpoint completes, which is not
guaranteed. We only guarantee that the checkpoint goes into the
checkpointer's queue before the checkpoint completes, not that it gets
processed before the checkpoint completes.

Suppose that RelationTruncate set both DELAY_CHKPT_START and
DELAY_CHKPT_COMPLETE. I think that would prevent this problem. P2
could still choose the redo LSN after P1 logged the truncate, but it
wouldn't then be able to reach CheckPointBuffers() until after P1 had
reached RegisterSyncRequest. Note that setting *only*
DELAY_CHKPT_START isn't good enough, because then we can get this
history:

P1: log truncate
P2: choose redo LSN
P2: CheckPointBuffers()
P1: DropRelationBuffers()
P2: ProcessSyncRequests()
P2: log checkpoint
*** system loses power ***

I think, in general, that DELAY_CHKPT_START is useful for cases where
CheckPointGuts() is going to flush something that might not be locked
at the time we WAL-log a modification to it. For instance,
MarkBufferDirtyHint() wants to log that the buffer will be dirty while
holding only a shared lock on the buffer. RecordTransactonCommit() and
RecordTransactionCommitPrepared() write WAL that necesitate CLOG
updates, but they haven't yet locked the relevant CLOG buffers.
DELAY_CHKPT_END, on the other hand, is useful for cases where some
operation needs to get completed before the redo pointer gets moved.
What I think we missed in 412ad7a55639516f284cd0ef9757d6ae5c7abd43 is
that RelationTruncate actually falls into both categories -- the
ftruncate() needs to be protected by DELAY_CHKPT_END so that it
happens before the redo pointer moves, but the RegisterSyncRequest()
needs to be protected by DELAY_CHKPT_START so that it gets "flushed"
by ProcessSyncRequests.

I'm not quite sure that this idea closes all the holes, though. The
delay-checkpoint mechanism is quite hard to reason about. I spent a
while this morning trying to think up something better but realized at
the end I'd just reinvented the wheel.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2023-10-12 03:36:53 Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Previous Message Robert Haas 2023-10-11 14:44:06 Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows