Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date: 2021-06-28 02:45:56
Message-ID: CA+hUKGLbK6j-jxf=2odz2kuEEwcRxjJiko=4uMtXzktQ4KwzaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> We haven't heard of broken control files from the field, so that doesn't
> seem to be a problem in practice, at least not yet. Still, I would sleep
> better if the control file had more redundancy. For example, have two
> copies of it on disk. At startup, read both copies, and if they're both
> valid, ignore the one with older timestamp. When updating it, write over
> the older copy. That way, if you crash in the middle of updating it, the
> old copy is still intact.

Seems like a good idea. I somehow doubt that accessing pmem through
old school read()/write() interfaces is the future of databases, but
ideally this should work correctly, and the dependency is indeed
unnecessary if we are prepared to jump through more hoops in just a
couple of places. There may also be other benefits. In hindsight,
it's a bit strange that we don't have explicit documentation of this
requirement. There is some related (and rather dated) discussion of
sectors in wal.sgml but nothing to say that we need 512 byte atomic
sectors for correct operation, unless I've managed to miss it (even
though it's well known among people who read the source code).

I experimented with a slightly different approach, attached, and a TAP
test to exercise it. Instead of alternating between two copies, I
tried writing out both copies every time with a synchronisation
barrier in between (the same double-write principle some other
database uses to deal with torn data pages). I think it's mostly
equivalent to your scheme, though the updates are of course slower. I
was thinking that there may be other benefits to having two copies of
the "current" version around, for resilience (though perhaps they
should be in separate files, not done here), and maybe it's better to
avoid having to invent a timestamp scheme. Or maybe the two ideas
should be combined: when both CRC checks pass, you could still be more
careful which one you choose than I have been here. Or maybe trying
to be resilient against handwavy unknown forms of corruption is a
waste of time. I'm not proposing anything here, I was just trying out
ideas, for discussion.

Attachment Content-Type Size
0001-Remove-control-file-dependency-on-512-byte-sectors.patch text/x-patch 14.6 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2021-06-28 06:29:11 Re: Unicode FFFF Special Codepoint should always collate high.
Previous Message Michael Paquier 2021-06-28 02:19:48 Re: Assertion on create index concurrently