Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, rootcause000(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Date: 2023-10-06 04:46:23
Message-ID: CA+hUKG+-2rjGZC2kwqr2NMLBcEBp4uf59QT1advbWYF_uc+0Aw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Oct 6, 2023 at 9:18 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Fri, Oct 6, 2023 at 8:55 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > ... DELAY_CHKPT_COMPLETE ...
>
> About that... If the lights go out after the truncation and the
> delayed logging of the checkpoint, how do we know the truncation has
> actually reached the disk? mdtruncate() enqueues fsync() calls, but
> if we were already in phase 2 (see proc.h) of a checkpoint at that
> moment, they might be processed by the *next* checkpoint, not the one
> whose phase 3 we've carefully delayed there, no?

I didn't look into this for very long so I might be missing something
here, but I think there could be at least one bad sequence. If you
insert a couple of sleeps to jinx the scheduling, and hack
RelationTruncate() to request a checkpoint at a carefully chosen wrong
moment (see attached), then:

postgres=# create table t (i int);
CREATE TABLE
postgres=# insert into t select 1 from generate_series(1, 100);
INSERT 0 100
postgres=# checkpoint; -- checkpoint #1 just puts some data on disk
CHECKPOINT
postgres=# delete from t;
DELETE 100
postgres=# vacuum freeze t; -- truncates, starts unlucky checkpoint #2
VACUUM

If you trace the checkpointer's system calls you will see that
base/5/16384 (or whatever t's relfilenode is for you) is *not* fsync'd
by checkpoint #2. The following checkpoint #3 might eventually do it,
but if the kernel loses power after checkpoint #2 completes and there
is no checkpoint #3, the kernel might forget the truncation, and yet
replay starts too late to redo it. I think that bad sequence looks
like this:

P1: log truncate
P2: choose redo LSN
P1: DropRelationBuffers()
P2: CheckPointBuffers()
P2: ProcessSyncRequests()
P1: ftruncate()
P1: RegisterSyncRequest()
P2: log checkpoint
*** system loses power ***

I realise it is a different problem than the one reported, but it's
close. My initial thought is that perhaps we shouldn't allow a redo
LSN to be chosen until the sync request is registered, which is also
fairly close to the critical section boundaries being discussed for
ftruncate() error case. But that's not a phase the checkpoint delay
machinery currently knows how to delay. And there may well be better
ways...

Attachment Content-Type Size
hack.diff text/x-patch 1.7 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message vignesh C 2023-10-06 06:38:12 Re: [16+] subscription can end up in inconsistent state
Previous Message Peter Smith 2023-10-06 01:13:18 Re: [16+] subscription can end up in inconsistent state