pgsql: Fix bugs in vacuum of shared rels, by keeping their relcache ent

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: Fix bugs in vacuum of shared rels, by keeping their relcache ent
Date: 2018-06-12 18:31:08
Message-ID: E1fSo4S-00051p-RO@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Fix bugs in vacuum of shared rels, by keeping their relcache entries current.

When vacuum processes a relation it uses the corresponding relcache
entry's relfrozenxid / relminmxid as a cutoff for when to remove
tuples etc. Unfortunately for nailed relations (i.e. critical system
catalogs) bugs could frequently lead to the corresponding relcache
entry being stale.

This set of bugs could cause actual data corruption as vacuum would
potentially not remove the correct row versions, potentially reviving
them at a later point. After 699bf7d05c some corruptions in this vein
were prevented, but the additional error checks could also trigger
spuriously. Examples of such errors are:
ERROR: found xmin ... from before relfrozenxid ...
and
ERROR: found multixact ... from before relminmxid ...
To be caused by this bug the errors have to occur on system catalog
tables.

The two bugs are:

1) Invalidations for nailed relations were ignored, based on the
theory that the relcache entry for such tables doesn't
change. Which is largely true, except for fields like relfrozenxid
etc. This means that changes to relations vacuumed in other
sessions weren't picked up by already existing sessions. Luckily
autovacuum doesn't have particularly longrunning sessions.

2) For shared *and* nailed relations, the shared relcache init file
was never invalidated while running. That means that for such
tables (e.g. pg_authid, pg_database) it's not just already existing
sessions that are affected, but even new connections are as well.
That explains why the reports usually were about pg_authid et. al.

To fix 1), revalidate the rd_rel portion of a relcache entry when
invalid. This implies a bit of extra complexity to deal with
bootstrapping, but it's not too bad. The fix for 2) is simpler,
simply always remove both the shared and local init files.

Author: Andres Freund
Reviewed-By: Alvaro Herrera
Discussion:
https://postgr.es/m/20180525203736.crkbg36muzxrjj5e@alap3.anarazel.de
https://postgr.es/m/CAMa1XUhKSJd98JW4o9StWPrfS=11bPgG+_GDMxe25TvUY4Sugg@mail.gmail.com
https://postgr.es/m/CAKMFJucqbuoDRfxPDX39WhA3vJyxweRg_zDVXzncr6+5wOguWA@mail.gmail.com
https://postgr.es/m/CAGewt-ujGpMLQ09gXcUFMZaZsGJC98VXHEFbF-tpPB0fB13K+A@mail.gmail.com
Backpatch: 9.3-

Branch
------
REL_10_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/2ce64caaf793615d0f7a7ce4380e7255077d130c

Modified Files
--------------
src/backend/utils/cache/inval.c | 32 ++++---
src/backend/utils/cache/relcache.c | 186 +++++++++++++++++++++++++------------
src/include/storage/standbydefs.h | 2 +-
3 files changed, 146 insertions(+), 74 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2018-06-12 18:31:10 pgsql: Fix bugs in vacuum of shared rels, by keeping their relcache ent
Previous Message Andres Freund 2018-06-12 18:31:07 pgsql: Fix bugs in vacuum of shared rels, by keeping their relcache ent