Re: race condition when writing pg_control

From: Noah Misch <noah(at)leadboat(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition when writing pg_control
Date: 2024-07-30 23:54:55
Message-ID: 20240730235455.08.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 15, 2024 at 03:44:48PM +1200, Thomas Munro wrote:
> On Fri, Jul 12, 2024 at 11:43 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:
> > > On Fri, May 17, 2024 at 4:46 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > > > The specific problem here is that LocalProcessControlFile() runs in
> > > > every launched child for EXEC_BACKEND builds. Windows uses
> > > > EXEC_BACKEND, and Windows' NTFS file system is one of the two file
> > > > systems known to this list to have the concurrent read/write data
> > > > mashing problem (the other being ext4).
> >
> > > First idea idea I've come up with to avoid all of that: pass a copy of
> > > the "proto-controlfile", to coin a term for the one read early in
> > > postmaster startup by LocalProcessControlFile(). As far as I know,
> > > the only reason we need it is to suck some settings out of it that
> > > don't change while a cluster is running (mostly can't change after
> > > initdb, and checksums can only be {en,dis}abled while down). Right?
> > > Children can just "import" that sucker instead of calling
> > > LocalProcessControlFile() to figure out the size of WAL segments yada
> > > yada, I think? Later they will attach to the real one in shared
> > > memory for all future purposes, once normal interlocking is allowed.
> >
> > I like that strategy, particularly because it recreates what !EXEC_BACKEND
> > backends inherit from the postmaster. It might prevent future bugs that would
> > have been specific to EXEC_BACKEND.
>
> Thanks for looking! Yeah, that is a good way to put it.

Oops, the way I put it turned out to be false. Postmaster has ControlFile
pointing to shared memory before forking backends, so !EXEC_BACKEND children
are born that way. In the postmaster, ControlFile->checkPointCopy->redo does
change after each checkpoint.

> The only other idea I can think of is that the Postmaster could take
> all of the things that LocalProcessControlFile() wants to extract from
> the file, and transfer them via that struct used for EXEC_BACKEND as
> individual variables, instead of this new proto-controlfile copy. I
> think it would be a bigger change with no obvious-to-me additional
> benefit, so I didn't try it.

Yeah, that would be more future-proof but a bigger change. One could argue
for a yet-larger refactor so even the !EXEC_BACKEND case doesn't read those
fields from ControlFile memory. Then we could get rid of ControlFile ever
being set to something other than NULL or a shmem pointer. ControlFileData's
mix of initdb-time fields, postmaster-start-time fields, and changes-anytime
fields is inconvenient here.

The unknown is the value of that future proofing. Much EXEC_BACKEND early
startup code is shared with postmaster startup, which can assume it's the only
process. I can't rule out a future bug where that shared code does a read
that's harmless in postmaster startup but harmful when an EXEC_BACKEND child
runs the same read. For a changes-anytime field, the code would already be
subtly buggy in EXEC_BACKEND today, since it would be reading without an
LWLock. For a postmaster-start-time field, things should be okay so long as
we capture the proto ControlFileData after the last change to such fields.
That invariant is not trivial to achieve, but it's not gravely hard either.

A possible middle option would be to use the proto control file, but
explicitly set its changes-anytime fields to bogus values.

What's your preference? I don't think any of these would be bad decisions.
It could be clearer after enumerating how many ControlFile fields get used
this early.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2024-07-30 23:55:39 Re: Comment in portal.h
Previous Message David Rowley 2024-07-30 22:51:25 Re: can we mark upper/lower/textlike functions leakproof?