From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Hot standby queries see transient all-zeros pages |
Date: | 2024-12-14 02:41:06 |
Message-ID: | mhjyuguex77h5env7vtgdwb5pnaqikyx5xgwjthcmzuazlowgn@rcjvqvjizk45 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2024-12-14 13:41:40 +1300, Thomas Munro wrote:
> On Sat, Dec 14, 2024 at 11:41 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> > > The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar
> > > argument at https://postgr.es/m/flat/5101(dot)1328219790(at)sss(dot)pgh(dot)pa(dot)us that led
> > > me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
> > > failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp))
> > > in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros
> > > page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby
> > > existed, but it wasn't a bug until queries could run concurrently.
> >
> > Yep, this does sound like a problem.
>
> This was a bug of mine that was fixed later in another thread, which
> Noah might have mistaken for intended-but-wrong behaviour, because the
> name is weird IMHO.
>
> https://www.postgresql.org/message-id/flat/7ed10231-ce47-03d5-d3f9-4aea0dc7d5a4%40gmail.com
The name is weird...
> > Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
> > RestoreBlockImage() into a buffer that currently is pinned? Not sure if
> > there's actually all that much guarantee what transient state one can read
> > when reading a page concurrently to a memcpy(). I suspect it's practically
> > rare to see a problem, but one could imagine an memcpy implementation that
> > uses non-temporal writes, which afaict would leave you open to seeing quite
> > random states when reading concurrently, as the cache coherence protocol
> > doesn't protect anymore.
>
> I also found that strange, while working on commit e656657f. I
> observed and pondered the behaviour you mentioned while testing that:
> you're scanning a page holding only a pin, which on a primary would
> allow others to add more (invisible to you) tuples that you shouldn't
> attempt to access (based on visibility checks performed while you held
> the lock), and in recovery the exact same thing is possible, except
> that if it happens to include a FPI then it *will* overwrite the
> tuples that you are scanning... with the exact same ones-and-zeros.
> But I couldn't think of anything specifically wrong with it, ie that
> might break. Well, maybe it's not exactly the same ones-and-zeros in
> some hint-bit scenario, but if that is considered acceptable on a
> primary too (well maybe you're getting rid of it), but what's the
> difference?
FWIW, much more than hint bits can change outside of recovery - you're not
holding a lock on the tuple, so xmin/xmax and non-hint fields in infomask* can
change due to update/delete.
> Oh... that's a plain store and you're thinking that memcpy() might be
> allowed to use some weirder cache coherency modes for speed, ... sounds
> likely to break a lot of stuff, is that a real thing?!
It would have to synchronize at the *end* of the memcpy. But not necessarily
before. I think our pages are probably too small to make it likely that a
memcpy with such a path would use it. I'd assume the logic would be something
along the lines of "If the to-be-copied size is bigger than L3*2, use
non-temporal stores to read/write the data, as we won't have any cache hits".
> > > I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> > > already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> > > Otherwise, do RBM_ZERO_AND_LOCK."
> >
> > I think that should work. At least in the current code it looks near trivial
> > to implement, although the branch differences are going to be annoying.
> > As usual the hardest part would probably be the naming. Maybe
> > RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?
>
> That's already how RBM_ZERO_AND_LOCK works, and you are the third
> victim of its confusing name counting me (that's how I broke it in the
> vectored stuff) and Noah IIUC what happened here.
I think I knew at some point that it didn't work that way, but then got swept
up with worry upon reading Noah's thread, while already three layers deep in
something else :)
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | John Naylor | 2024-12-14 03:00:03 | Re: typo in a comment of restrictinfo.c |
Previous Message | Andres Freund | 2024-12-14 02:34:21 | Re: Hot standby queries see transient all-zeros pages |