pgsql: Fix cache invalidation bug in recovery_prefetch.

From: Thomas Munro <tmunro(at)postgresql(dot)org>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: Fix cache invalidation bug in recovery_prefetch.
Date: 2022-09-03 01:38:26
Message-ID: E1oUI77-001AEM-G3@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Fix cache invalidation bug in recovery_prefetch.

XLogPageRead() can retry internally after a pread() system call has
succeeded, in the case of short reads, and page validation failures
while in standby mode (see commit 0668719801). Due to an oversight in
commit 3f1ce973, these cases could leave stale data in the internal
cache of xlogreader.c without marking it invalid. The main defense
against stale cached data on failure to read a page was in the error
handling path of the calling function ReadPageInternal(), but that
wasn't quite enough for errors handled internally by XLogPageRead()'s
retry loop if we then exited with XLREAD_WOULDBLOCK.

1. ReadPageInternal() now marks the cache invalid before calling the
page_read callback, by setting state->readLen to 0. It'll be set to
a non-zero value only after a successful read. It'll stay valid as
long as the caller requests data in the cached range.

2. XLogPageRead() no long performs internal retries while reading
ahead. While such retries should work, the general philosophy is
that we should give up prefetching if anything unusual happens so we
can handle it when recovery catches up, to reduce the complexity of
the system. Let's do that here too.

3. While here, a new function XLogReaderResetError() improves the
separation between xlogrecovery.c and xlogreader.c, where the former
previously clobbered the latter's internal error buffer directly.
The new function makes this more explicit, and also clears a related
flag, without which a standby would needlessly retry in the outer
function.

Thanks to Noah Misch for tracking down the conditions required for a
rare build farm failure in src/bin/pg_ctl/t/003_promote.pl, and
providing a reproducer.

Back-patch to 15.

Reported-by: Noah Misch <noah(at)leadboat(dot)com>
Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com

Branch
------
REL_15_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/d0d934490020f9311662da8cdba4ccc2070e420d

Modified Files
--------------
src/backend/access/transam/xlogreader.c | 24 +++++++++++++++++++-----
src/backend/access/transam/xlogrecovery.c | 10 +++++++++-
src/include/access/xlogreader.h | 3 +++
3 files changed, 31 insertions(+), 6 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Bruce Momjian 2022-09-03 01:48:05 pgsql: relnotes: improve collation check and ICU items
Previous Message Thomas Munro 2022-09-03 01:38:15 pgsql: Fix cache invalidation bug in recovery_prefetch.