Re: HOT chain validation in verify_heapam()

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Himanshu Upadhyaya <upadhyaya(dot)himanshu(at)gmail(dot)com>, Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: HOT chain validation in verify_heapam()
Date: 2022-11-15 16:36:21
Message-ID: CA+TgmoY+QZLsB9m+JgmFHiVonhE_utc+6dWgYZjCwXa0-=SSng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 14, 2022 at 5:02 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2022-11-14 14:27:54 -0500, Robert Haas wrote:
> > On Wed, Nov 9, 2022 at 5:08 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > I don't really understand this logic - why can't we populate the predecessor
> > > array, if we can construct a successor entry?
> >
> > This whole thing was my idea, so let me try to explain. I think the
> > naming and comments need work, but I believe the fundamental idea may
> > be sound.
> >
> > successor[x] = y means that when we looked at line pointer x, we saw
> > that it was either a redirect to line pointer y, or else it had
> > storage and the associated tuple's CTID pointed to line pointer y.
>
> > At this point, we do not have any idea whether y is at all sane, nor we do
> > we know anything about which of x and y is larger.
>
> What do you mean with "larger" here?

Numerically bigger. As in, a redirect line pointer or CTID will most
commonly point to a tuple that appears later in the line pointer
array, because we assign offset numbers in ascending order. But we
also reuse line pointers, so it's possible for a redirect line pointer
or CTID to point backwards to a lower-number offset.

> > Furthermore, it is
> > possible that successor[x] = successor[x'] since the page might be corrupted
> > and we haven't checked otherwise.
> >
> > predecessor[y] = x means that successor[x] = y but in addition we've
> > checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> > tuples for which these conditions hold, we've issued complaints about
> > all but one and entered the last into the predecessor array.
>
> As shown by the isolationtester test I just posted, this doesn't quite work
> right now. Probably fixable.
>
> I don't think we can follow non-HOT ctid chains if they're older than the xmin
> horizon, including all cases of xmin being frozen. There's just nothing
> guaranteeing that the tuples are actually "related".

Yeah, glad you caught that. I think it's clearly wrong to regard A and
B as related if B.xmin is frozen. It doesn't matter whether it's
"old-style" frozen where we actually put 2 into the xmin field, or
new-style frozen where we set hint bits. Because, even if it's
new-style frozen and the value actually stored in B.xmin is
numerically equal to A.xmax, it could be from a different epoch. If
it's from the same epoch, then something is corrupted, because when we
froze B we should have pruned A. But we have no way of knowing whether
that's the case, and shouldn't assume corruption.

> It seems like we should do a bit more validation within a chain of
> tuples. E.g. that no live tuple can follow an !DidCommit xmin?

I think this check is already present in stronger form. If we see a
!DidCommit xmin, the xmin of the next tuple in the chain not only
can't be committed, but had better be the same. See "tuple with
uncommitted xmin %u was updated to produce a tuple at offset %u with
differing xmin %u".

> I now think that the 9.4 specific reasoning is bogus in the first place. The
> patch says:
>
> * Add a line pointer offset to the predecessor array if xmax is
> * matching with xmin of next tuple (reaching via its t_ctid).
> * Prior to PostgreSQL 9.4, we actually changed the xmin to
> * FrozenTransactionId so we must add offset to predecessor
> * array(irrespective of xmax-xmin matching) if updated tuple xmin
> * is frozen, so that we can later do validation related to frozen
> * xmin. Raise corruption if we have two tuples having the same
> * predecessor.
>
> but it's simply not correct to iterate through xmin=FrozenTransactionId - as
> shown in the isolationtester test. And that's unrelated to 9.4, because we
> couldn't rely on the raw xmin value either, because even if they match, they
> could be from different epochs.

I agree completely.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2022-11-15 16:45:13 Re: archive modules
Previous Message Andrew Dunstan 2022-11-15 16:35:55 Re: Error-safe user functions