Re: Hot standby queries see transient all-zeros pages

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hot standby queries see transient all-zeros pages
Date: 2024-12-14 02:41:06
Message-ID: mhjyuguex77h5env7vtgdwb5pnaqikyx5xgwjthcmzuazlowgn@rcjvqvjizk45
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-12-14 13:41:40 +1300, Thomas Munro wrote:
> On Sat, Dec 14, 2024 at 11:41 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> > > The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar
> > > argument at https://postgr.es/m/flat/5101(dot)1328219790(at)sss(dot)pgh(dot)pa(dot)us that led
> > > me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
> > > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp))
> > > in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros
> > > page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby
> > > existed, but it wasn't a bug until queries could run concurrently.
> >
> > Yep, this does sound like a problem.
>
> This was a bug of mine that was fixed later in another thread, which
> Noah might have mistaken for intended-but-wrong behaviour, because the
> name is weird IMHO.
>
> https://www.postgresql.org/message-id/flat/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com

The name is weird...

> > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
> > RestoreBlockImage() into a buffer that currently is pinned? Not sure if
> > there's actually all that much guarantee what transient state one can read
> > when reading a page concurrently to a memcpy(). I suspect it's practically
> > rare to see a problem, but one could imagine an memcpy implementation that
> > uses non-temporal writes, which afaict would leave you open to seeing quite
> > random states when reading concurrently, as the cache coherence protocol
> > doesn't protect anymore.
>
> I also found that strange, while working on commit e656657f. I
> observed and pondered the behaviour you mentioned while testing that:
> you're scanning a page holding only a pin, which on a primary would
> allow others to add more (invisible to you) tuples that you shouldn't
> attempt to access (based on visibility checks performed while you held
> the lock), and in recovery the exact same thing is possible, except
> that if it happens to include a FPI then it *will* overwrite the
> tuples that you are scanning... with the exact same ones-and-zeros.
> But I couldn't think of anything specifically wrong with it, ie that
> might break. Well, maybe it's not exactly the same ones-and-zeros in
> some hint-bit scenario, but if that is considered acceptable on a
> primary too (well maybe you're getting rid of it), but what's the
> difference?

FWIW, much more than hint bits can change outside of recovery - you're not
holding a lock on the tuple, so xmin/xmax and non-hint fields in infomask* can
change due to update/delete.

> Oh... that's a plain store and you're thinking that memcpy() might be
> allowed to use some weirder cache coherency modes for speed, ... sounds
> likely to break a lot of stuff, is that a real thing?!

It would have to synchronize at the *end* of the memcpy. But not necessarily
before. I think our pages are probably too small to make it likely that a
memcpy with such a path would use it. I'd assume the logic would be something
along the lines of "If the to-be-copied size is bigger than L3*2, use
non-temporal stores to read/write the data, as we won't have any cache hits".

> > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> > > Otherwise, do RBM_ZERO_AND_LOCK."
> >
> > I think that should work. At least in the current code it looks near trivial
> > to implement, although the branch differences are going to be annoying.
> > As usual the hardest part would probably be the naming. Maybe
> > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?
>
> That's already how RBM_ZERO_AND_LOCK works, and you are the third
> victim of its confusing name counting me (that's how I broke it in the
> vectored stuff) and Noah IIUC what happened here.

I think I knew at some point that it didn't work that way, but then got swept
up with worry upon reading Noah's thread, while already three layers deep in
something else :)

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2024-12-14 03:00:03 Re: typo in a comment of restrictinfo.c
Previous Message Andres Freund 2024-12-14 02:34:21 Re: Hot standby queries see transient all-zeros pages