Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, rootcause000(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Date: 2023-10-04 22:44:14
Message-ID: ZR3qvrYULJWaUnBK@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Oct 05, 2023 at 10:12:27AM +1300, Thomas Munro wrote:
> On Thu, Oct 5, 2023 at 3:26 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I'm too lazy to check the commit log right now, but I think
>> we did implement a fix for that (ie, flush dirty pages even
>> if we anticipate them going away due to truncation). But as
>> Laurenz says, v10 is out of support and possibly didn't get
>> that fix. Even if it did, you'd need to be running one of
>> the last minor releases, because this wasn't very long ago.
>
> This thread seems to be saying otherwise:
>
> https://www.postgresql.org/message-id/flat/2348.1544474335%40sss.pgh.pa.us

Yeah, that's not been solved. I've put my mind on this problem for a
few hours last May, just before PGCon, and there was an argument that
marking all the pages as dirty is kind of a waste of resources as it
would create WAL for data that's going to be gone a little bit later
as an effect of the truncate, leading to an extra burst of I/O
especially for large truncatoins. FWIW, I think that I'd be
personally OK with using this method. At least that's safe, simple,
backpatchable and it does not require any new magic. I know that
there are voices that argued against this method, but here we are N
years later, so perhaps we should just do that on HEAD at least.

>> In the end though, the *real* problem here is running on a
>> platform that randomly disallows writes to disk. There's only
>> so much that Postgres can possibly do about unreliability of the
>> underlying platform. I would never run a production database on
>> Windows, because it's just too prone to that sort of BS.
>
> It's surprising that ftruncate() AKA chsize() is able to fail like
> this (I am not a Windows user but AFAIR that sharing stuff obstructs
> stuff like open, unlink, rename, so it surprises me to see it come up
> with ftruncate, since we must already have made it past the open
> stage). Hmm, the documentation is scant, but I know from my attempts
> to use large files that chsize() is probably some kind of wrapper
> around SetEndOfFile() or similar, and that is documented as failing if
> someone has the file mapped. I don't know why someone would have the
> file mapped, though.

(shrug)

> But as for what we should do about it, PANIC (as suggested by several
> people) seems better than corruption, if we're not going to write some
> kind of resilience? How else are we supposed to deal with "this
> shouldn't happen, and if it does we're hosed?"

A PANIC may be OK for this specific syscall and would be better, but
the problematic area is larger than that as we'd still finish with a
corruption as long as there's an ERROR or a FATAL between the moment
the buffers (potentially dirty, with live-still-dead-in-memory tuples
on disk) are discarded and the moment the truncation fails. Another
method discussed is the use of a critical section (I recall that there
were some pallocs in this area, actually, but got nothing on my notes
about that...).
--
Michael

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2023-10-04 23:02:09 Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Previous Message Thomas Munro 2023-10-04 22:38:24 Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows