Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Alexander Lakhin <exclusion(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Date: 2024-05-09 21:56:34
Message-ID: CAAKRu_aUJ-yWy67QnytdNmvKjsAyZtVgnQpsdi-3hk_SbTL3Mg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Apr 26, 2024 at 4:46 PM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Mon, Apr 15, 2024 at 2:52 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Hi,
> >
> > On 2024-03-29 12:05:31 -0400, Robert Haas wrote:
> > > Perhaps, for some reason I'm not grokking at the moment, preventing
> > > maybe_needed from retreating would be good enough to prevent trouble
> > > in practice, but it doesn't appear to me to be the most principled
> > > solution as of this writing.
> >
> > If we prevent maybe_needed from retreating below the OldestXmin value used for
> > cutoff, then pruning should never be less aggressive than what's needed by
> > lazy_scan_prune(). So lazy_scan_prune() shouldn't be able to encounter tuples
> > that weren't considered removable in heap_page_prune(), in < 17, nor would
> > heap_page_prune_and_freeze() have that issue.
> >
> > I think one fairly easy fix for this would be to pass OldestXmin to
> > heap_page_prune[_and_freeze](), and have heap_prune_satisfies_vacuum() first
> > check OldestXmin and only if that considers the tuple still running, use
> > GlobalVisTestIsRemovableXid(). That continues to allow us to prune more
> > aggressively, without any potential risk of IsRemoval being too conservative.
>
> It seems to me that in all stable versions containing the retry loop
> (from 8523492d4e34), the following can theoretically happen, causing
> an infinite loop in lazy_scan_prune():
>
> 1) tuple's xmin and xmax are older than VacuumCutoffs->OldestXmin
> 2) heap_page_prune()-> heap_prune_satisfies_vacuum()->
> HeapTupleSatisfiesVacuumHorizon() returns HEAPTUPLE_RECENTLY_DEAD
> 3) in GlobalVisTestIsRemovableXid(), dead_after is between
> GlobalVisState->maybe_needed and definitely_needed, causing
> GlobalVisState to be recomputed
> 4) GlobalVisState->maybe_needed moves backwards
> 5) tuple is not removable because dead_after is now newer than maybe_needed
> 6) in lazy_scan_prune(), HeapTupleSatisfiesVacuum() returns
> HEAPTUPLE_DEAD because dead_after is older than OldestXmin
> 7) infinite loop
>
> One way to fix this (as mentioned earlier in this thread) is to
> backport 1ccc1e05ae because it eliminates the retry loop. In our above
> example, lazy_scan_prune() reuses the tuple's HEAPTUPLE_RECENTLY_DEAD
> HTSV_Result instead of recomputing it. We'd have to rule out any of
> the data corruption claims about that commit to justify this, but I am
> not yet convinced that 1ccc1e05ae could cause any problems with
> relfrozenxid advancement.
>
> Andres' idea of passing VacuumCutoffs->OldestXmin to heap_page_prune()
> makes sense. We could add a member to PruneState, oldest_xmin, and
> initialize it to InvalidTransactionId for on-access pruning and to
> VacuumCutoffs->OldestXmin for vacuum. Then, in
> heap_prune_satisfies_vacuum(), we could add this if branch:
>
> if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin))
>
> to here:
>
> - if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
> +
> + if (TransactionIdPrecedes(dead_after, prstate->oldest_xmin))
> + res = HEAPTUPLE_DEAD;
> + else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
> res = HEAPTUPLE_DEAD;
> else if (OldSnapshotThresholdActive())
>
> This seems like a more targeted fix than backpatching 1ccc1e05ae.
>
> I plan to attempt to write a test case that repros this scenario
> (infinite loop on stable branch) next week.

I can repro the hang on 14 and 15 with the following:

I start a primary with the following options:

wal_level = replica, hot_standby_feedback=on
primary_conninfo='host=localhost port=6432 dbname=postgres'

then take a basebackup and start the replica.

Then I connect with psql five times (sessions A, B, and C on the
primary and sessions A and B on the replica):

Then I do the following steps.

step 1:
PRIMARY SESSION A:
--
CREATE TABLE foo(a INT) WITH (autovacuum_enabled=false);
CREATE INDEX foo_idx ON foo(a);
INSERT INTO foo VALUES (1);
UPDATE foo SET a = 100 WHERE a = 1;

DROP TABLESPACE IF EXISTS foo_tablespace;
CREATE TABLESPACE foo_tablespace LOCATION 'somelocation';

step 2:
REPLICA SESSION A:
--
ALTER SYSTEM SET primary_conninfo = 'host=localhost port=9999
dbname=postgres';
SELECT pg_reload_conf();
SELECT pg_terminate_backend((select pid from pg_stat_activity where
backend_type = 'walreceiver'));
BEGIN;
DECLARE c1 CURSOR FOR SELECT * FROM foo;
FETCH FORWARD FROM c1;

step 3:
PRIMARY SESSION A:
--
INSERT INTO foo VALUES (99);
UPDATE foo SET a = 100 WHERE a = 99;

step 4:
PRIMARY SESSION B:
--
BEGIN;
DECLARE c1 CURSOR FOR SELECT * FROM foo;
FETCH FORWARD FROM c1;

step 5:
PRIMARY SESSION C:
--
BEGIN;
ALTER INDEX foo_idx SET TABLESPACE foo_tablespace;

step 6:
PRIMARY SESSION A:
--
VACUUM (FREEZE) foo;

step 7:
PRIMARY SESSION B:
--
COMMIT;

step 8:
REPLICA SESSION B:
--
ALTER SYSTEM SET primary_conninfo = 'host=localhost port=6432
dbname=postgres';
SELECT pg_reload_conf();

step 9:
PRIMARY SESSION C:
--
COMMIT;

step 10:
REPLICA SESSION A:
--
COMMIT;

This infinitely loops in lazy_scan_prune() on 14 and 15. On 16 and
master, everything is normal (for this specific scenario).

The steps roughly follow what I wrote in my previous email.

The difference between 16/master and 14/15, is that in 14/15,
vacuum_set_xid_limits() is called before opening indexes. This allows
an intervening relcache invalidation caused by modifying an index on
the table being vacuumed to force us to rebuild the catalog snapshot
between vacuum_set_xid_limits() and eventually pruning the tuples.
Rebuilding the catalog snapshot will update RecentXmin to a
potentially different value than ComputeXidHorizonsResultLastXmin (
ComputeXidHorizonResultLastXmin is set to RecentXmin during
vacuum_set_xid_limits()). This provides the opportunity for
GlobalVisTestShouldUpdate() to return true while pruning RECENTLY_DEAD
tuples.

See this line in GlobalVisTestShouldUpdate():
return RecentXmin != ComputeXidHorizonsResultLastXmin;

On 16 and master, AFAICT, RecentXmin will always equal
ComputeXidHorizonsResultLastXmin in GlobalVisTestShouldUpdate() when
called through heap_prune_satisfies_vacuum(). So, even if a wal sender
pushes the horizon backwards on the primary, vacuum's
GlobalVisState->maybe_needed won't be pushed back to a value lower
than OldestXmin.

I haven't yet written the fix and tested it. I want to investigate a
bit more and write more detailed notes on the repro steps. I also want
to fully convince myself this isn't possible on master or 16.

- Melanie

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Corey Huinker 2024-05-10 02:06:25 Re: BUG #18429: Inconsistent results on similar queries with join lateral
Previous Message PG Bug reporting form 2024-05-09 18:28:02 BUG #18459: running pg_ctl.exe from inside a program running as a Windows service returns "error code 6"