Quick Links

Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, rootcause000(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Date:	2025-04-06 18:16:13
Message-ID:	20250406181613.61.nmisch@google.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Fri, Sep 06, 2024 at 11:05:26AM +1200, Thomas Munro wrote:
> Subject: [PATCH v3 1/3] RelationTruncate() must set DELAY_CHKPT_START.
>
> Previously, it set only DELAY_CHKPT_COMPLETE. That was important,
> because it meant that if the XLOG_SMGR_TRUNCATE record preceded a
> XLOG_CHECKPOINT_ONLINE record in the WAL, then the truncation would also
> happen on disk before the XLOG_CHECKPOINT_ONLINE record was
> written.

I read this commit (75818b3) with intense interest, mostly to see if it found
a rule that commit 8e7e672 needs to follow and didn't (no). I did want to
find or make an example of interleaved events where DELAY_CHKPT_COMPLETE is
still necessary. Do any of you have such an example? Here were my
unsuccessful attempts:

==== Attempt: from upthread

On Wed, Oct 11, 2023 at 12:59:39PM -0400, Robert Haas wrote:
> Suppose that RelationTruncate set both DELAY_CHKPT_START and
> DELAY_CHKPT_COMPLETE. I think that would prevent this problem. P2
> could still choose the redo LSN after P1 logged the truncate, but it
> wouldn't then be able to reach CheckPointBuffers() until after P1 had
> reached RegisterSyncRequest. Note that setting *only*
> DELAY_CHKPT_START isn't good enough, because then we can get this
> history:
>
> P1: log truncate
> P2: choose redo LSN
> P2: CheckPointBuffers()
> P1: DropRelationBuffers()
> P2: ProcessSyncRequests()
> P2: log checkpoint
> *** system loses power ***

If "choose redo LSN" happens at the point shown there, DELAY_CHKPT_START won't
allow CheckPointBuffers() at the point shown there, so this interleaving
doesn't happen.

==== Attempt: XLOG_SMGR_TRUNCATE+DropRelationBuffers() just after checkpoint waits for DELAY_CHKPT_START

P2: choose redo LSN 1
P2: CheckPointBuffers() enter
P1: DELAY_CHKPT_START enter
P1: log truncate
P1: DropRelationBuffers()
P2: CheckPointBuffers() actual flushes
P2: ProcessSyncRequests()
P2: log checkpoint
P2: choose redo LSN 2 (next checkpoint)
*** system loses power, below were still in future ***
P1: ftruncate()
P1: RegisterSyncRequest()
P1: DELAY_CHKPT_START exit

DropRelationBuffers() made CheckPointBuffers() do less work, and the validity
of the checkpoint is now tied to truncate happening eventually. However, if
XLOG_CHECKPOINT_ONLINE reaches disk, that implies XLOG_SMGR_TRUNCATE reached
disk first. Any recovery starting at or before LSN 1 will redo
XLOG_SMGR_TRUNCATE. Data integrity is fine.

==== Attempt: replay finding "older contents than expected"

A key RelationTruncate() code comment mentions this:

* First, the truncation operation might drop buffers that the checkpoint
* otherwise would have flushed. If it does, then it's essential that the
* files actually get truncated on disk before the checkpoint record is
* written. Otherwise, if replay begins from that checkpoint, the
* to-be-truncated blocks might still exist on disk but have older
* contents than expected, which can cause replay to fail. It's OK for the
* blocks to not exist on disk at all, but not for them to have the wrong
* contents. For this reason, we need to set DELAY_CHKPT_COMPLETE while
* this code executes.

However, I can't see how to make that happen. RelationTruncate() has
AccessExclusiveLock on the relation, so no other WAL records for the truncated
block range are happening while we hold DELAY_CHKPT_START. In the previous
attempt, any WAL records for the truncated block range must be before
XLOG_SMGR_TRUNCATE (such records tolerate older content) or after
XLOG_CHECKPOINT_ONLINE (since RelationTruncate() held AccessExclusiveLock at
least that late).

==== Attempt: all truncate steps just after XLOG_CHECKPOINT_REDO

P1: DELAY_CHKPT_START enter
P2: choose redo LSN
P1: log truncate
P1: DropRelationBuffers()
P1: ftruncate()
P1: RegisterSyncRequest()
P1: DELAY_CHKPT_START exit
P2: CheckPointBuffers()
P2: ProcessSyncRequests()
P2: log checkpoint
*** system loses power ***

Recovery sees the XLOG_CHECKPOINT_ONLINE, which implies it also sees and
replays the XLOG_SMGR_TRUNCATE. Data integrity is fine.

==== Attempt: XLOG_SMGR_TRUNCATE before XLOG_CHECKPOINT_REDO

P1: DELAY_CHKPT_START enter
P1: log truncate
P2: choose redo LSN
P1: DropRelationBuffers()
P1: ftruncate()
P1: RegisterSyncRequest()
P1: DELAY_CHKPT_START exit
P2: CheckPointBuffers()
P2: ProcessSyncRequests()
P2: log checkpoint
*** system loses power ***

XLOG_SMGR_TRUNCATE precedes the redo point, so recovery finds no need to redo
the XLOG_SMGR_TRUNCATE. Data integrity is fine.

In response to

Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows at 2024-09-05 23:05:26 from Thomas Munro

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Daniel Westermann (DWE)	2025-04-07 07:15:02	Re: BUG #18880: Can't have a database without having postgres as a db_name and username
Previous Message	PG Bug reporting form	2025-04-06 02:20:27	BUG #18880: Can't have a database without having postgres as a db_name and username