Quick Links

Re: Incorrect snapshots while promoting hot standby node when 2PC is used

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject:	Re: Incorrect snapshots while promoting hot standby node when 2PC is used
Date:	2021-05-04 17:13:37
Message-ID:	20210504171337.o2fathpgatalkvm2@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2021-05-04 12:32:34 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > Michael Paquier (running locally I think), and subsequently Thomas Munro
> > (noticing [1]), privately reported that they noticed an assertion failure in
> > GetSnapshotData(). Both reasonably were wondering if that's related to the
> > snapshot scalability patches.
> > Michael reported the following assertion failure in 023_pitr_prepared_xact.pl:
> >> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2468, PID: 22901)
>
> mantid just showed a failure that looks like the same thing, at
> least it's also in 023_pitr_prepared_xact.pl:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mantid&dt=2021-05-03%2013%3A07%3A06
>
> The assertion line number is rather different though:
>
> TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2094, PID: 1163004)

I managed to hit that one as well and it's also what fairywren hit - the
assertion in 2094 and 2468 are basically copies of the same check, and
which one hit is a question of timing.

> and interestingly, this happened in a parallel worker:

I think the issue can be hit (or rather detected) whenever a transaction
builds one snapshot while in recovery, and a second one during
end-of-recovery. The parallel query here is just
2021-05-03 09:18:35.602 EDT [1162987:6] DETAIL: Failed process was running: SELECT pg_is_in_recovery() = 'f';
(parallel due to force_parallel_mode) - which of course is likely to run
during end-of-recovery

So it does seem like the same bug of resetting the KnownAssignedXids
stuff too early.

Greetings,

Andres Freund

In response to

Re: Incorrect snapshots while promoting hot standby node when 2PC is used at 2021-05-04 16:32:34 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Greg Stark	2021-05-04 17:35:50	Re: PG in container w/ pid namespace is init, process exits cause restart
Previous Message	Robert Haas	2021-05-04 16:56:37	Re: MaxOffsetNumber for Table AMs