Re: Hot standby queries see transient all-zeros pages

From: Andres Freund <andres(at)anarazel(dot)de>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hot standby queries see transient all-zeros pages
Date: 2024-12-13 22:41:15
Message-ID: yrqjqqiqiyhb4pwknw2vphrfhmtwmdn7xiqeux32q65lsox3h7@dzktq4py4lvw
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Just found this thread because I was looking for discussions for some
behaviour of XLogReadBufferExtended()...

Afaics we didn't do anything about this issue?

I just tried to repro this, without success so far. But it very well might be
sufficiently timing dependent that it just happens to not trigger on my
machine. I'm actually not sure what'd trigger a problematic pattern in
027_stream_regress.pl, there's not a whole lot of reading while replaying
records.

On 2024-05-12 10:16:58 -0700, Noah Misch wrote:
> XLogReadBufferForRedoExtended() precedes RestoreBlockImage() with
> RBM_ZERO_AND_LOCK. Per src/backend/storage/buffer/README:
>
> Once one has determined that a tuple is interesting (visible to the current
> transaction) one may drop the content lock, yet continue to access the
> tuple's data for as long as one holds the buffer pin.
>
> The use of RBM_ZERO_AND_LOCK is incompatible with that. See a similar
> argument at https://postgr.es/m/flat/5101(dot)1328219790(at)sss(dot)pgh(dot)pa(dot)us that led
> me to the cause. Adding a 10ms sleep just after RBM_ZERO_AND_LOCK, I got 2
> failures in 7 runs of 027_stream_regress.pl, at Assert(ItemIdIsNormal(lpp))
> in heapgettup_pagemode(). In the core file, lpp pointed into an all-zeros
> page. RestoreBkpBlocks() had been doing RBM_ZERO years before hot standby
> existed, but it wasn't a bug until queries could run concurrently.

Yep, this does sound like a problem.

I looked around to see if other uses of RBM_ZERO_AND_LOCK are problematic, but
so far I didn't find anything. Most paths that lead to RBM_ZERO_AND_LOCK being
used should be safe, due to those pages not having valid contents beforehand
when in an consistent state. E.g. the RBM_ZERO_AND_LOCK that originate from
XLogInitBufferForRedo should only ever target pages that have been empty (when
consistent) and thus shouldn't have anybody pointing into an unlocked buffer.

Hm. Leaving RBM_ZERO_AND_LOCK aside, is it actually always safe to do
RestoreBlockImage() into a buffer that currently is pinned? Not sure if
there's actually all that much guarantee what transient state one can read
when reading a page concurrently to a memcpy(). I suspect it's practically
rare to see a problem, but one could imagine an memcpy implementation that
uses non-temporal writes, which afaict would leave you open to seeing quite
random states when reading concurrently, as the cache coherence protocol
doesn't protect anymore.

> I suspect the fix is to add a ReadBufferMode specified as, "If the block is
> already in shared_buffers, do RBM_NORMAL and exclusive-lock the buffer.
> Otherwise, do RBM_ZERO_AND_LOCK."

I think that should work. At least in the current code it looks near trivial
to implement, although the branch differences are going to be annoying.

As usual the hardest part would probably be the naming. Maybe
RBM_ZERO_ON_MISS_LOCK? RBM_LOCK_ZERO_ON_MISS? RBM_DWIM?

> Like RBM_ZERO_AND_LOCK, it avoids wasting disk reads
> on data we discard. Are there other strategies to consider?

I guess we could just use RBM_ZERO_AND_CLEANUP_LOCK for this path. But that
seems like it has too many obvious downsides.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-12-13 22:53:00 Re: Eagerly scan all-visible pages to amortize aggressive vacuum
Previous Message Peter Geoghegan 2024-12-13 22:26:22 Re: bt_index_parent_check and concurrently build indexes