Quick Links

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject:	Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date:	2019-03-04 21:59:13
Message-ID:	CAH2-WzkETf6tfQYLg8aqtuy8omAGtg9TDH6ZACS_v7vDghR8wA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Some comments on
> v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
> Mostly about code comments. In general, I think a round of copy-editing
> the comments, to use simpler language, would do good. The actual code
> changes look good to me.

I'm delighted that the code looks good to you, and makes sense
overall. I worked hard to make the patch a natural adjunct to the
existing code, which wasn't easy.

> Seems confusing to first say assertively that "*bufptr contains the page
> that the new tuple unambiguously belongs to", and then immediately go on
> to list a whole bunch of exceptions. Maybe just remove "unambiguously".

This is fixed in v14 of the patch series.

> This happens very seldom, because you only get an incomplete split if
> you crash in the middle of a page split, and that should be very rare. I
> don't think we need to list more fine-grained conditions here, that just
> confuses the reader.

Fixed in v14.

> > /*
> > * _bt_useduplicatepage() -- Settle for this page of duplicates?

> So, this function is only used for legacy pg_upgraded indexes. The
> comment implies that, but doesn't actually say it.

I made that more explicit in v14.

> > /*
> > * Get tie-breaker heap TID attribute, if any. Macro works with both pivot
> > * and non-pivot tuples, despite differences in how heap TID is represented.

> > #define BTreeTupleGetHeapTID(itup) ...

I fixed up the comments above BTreeTupleGetHeapTID() significantly.

> The comment claims that "all pivot tuples must be as of BTREE_VERSION
> 4". I thought that all internal tuples are called pivot tuples, even on
> version 3.

In my mind, "pivot tuple" is a term that describes any tuple that
contains a separator key, which could apply to any nbtree version.
It's useful to have a distinct term (to not just say "separator key
tuple") because Lehman and Yao think of separator keys as separate and
distinct from downlinks. Internal page splits actually split *between*
a separator key and a downlink. So nbtree internal page splits must
split "inside a pivot tuple", leaving its separator on the left hand
side (new high key), and its downlink on the right hand side (new
minus infinity tuple).

Pivot tuples may contain a separator key and a downlink, just a
downlink, or just a separator key (sometimes this is implicit, and the
block number is garbage). I am particular about the terminology
because the pivot tuple vs. downlink vs. separator key thing causes a
lot of confusion, particularly when you're using Lehman and Yao (or
Lanin and Shasha) to understand how things work in Postgres.

We wan't to have a broad term that refers to the tuples that describe
the keyspace (pivot tuples), since it's often helpful to refer to them
collectively, without seeming to contradict Lehman and Yao.

> I think what this means to say is that this macro is only
> used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
> have a heap TID in BTREE_VERSION 4 indexes.

My high level approach to pg_upgrade/versioning is for index scans to
*pretend* that every nbtree index (even on v2 and v3) has a heap
attribute that actually makes the keys unique. The difference is that
v4 gets to use a scantid, and actually rely on the sort order of heap
TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the
heap attribute", and must never provide a scantid (they also cannot
use the !minusinfkey optimization, but this is only an optimization
that v4 indexes don't truly need). They always do the right thing
(move left) on otherwise-equal pivot tuples, since they have no
scantid.

That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring
about the version the index uses. It might be NULL for a pivot tuple
in a v3 index, even though we imagine/pretend that it should have a
value set. But that doesn't matter, because higher level code knows
that !heapkeyspace indexes should never get a scantid (_bt_compare()
does Assert() that they got that detail right, though). We "have no
reason to peak", because we don't have a scantid, so index scans work
essentially the same way, regardless of the version in use.

There are a few specific cross-version things that we need think about
outside of making sure that there is never a scantid (and !minusinfkey
optimization is unused) in < v4 indexes, but these are all related to
unique indexes. "Pretending that all indexes have a heap TID" is a
very useful mental model. Nothing really changes, even though you
might guess that changing the classic "Subtree S is described by Ki <
v <= Ki+1" invariant would need to break code in
_bt_binsrch()/_bt_compare(). Just pretend that the classic invariant
was there since the Berkeley days, and don't do anything that breaks
the useful illusion on versions before v4.

> This macro (and many others in nbtree.h) is quite complicated. A static
> inline function might be easier to read.

I agree that the macros are complicated, but that seems to be because
the rules are complicated. I'd rather leave the macros in place, and
improve the commentary on the rules.

> 'xlmeta.version' is set incorrectly.

Oops. Fixed in v14.

> I find this comment difficult to read. I suggest rewriting it to:
>
> /*
> * The current Btree version is 4. That's what you'll get when you create
> * a new index.

I used your wording for this in v14, almost verbatim.

> Now that the index tuple format becomes more complicated, I feel that
> there should be some kind of an overview explaining the format. All the
> information is there, in the comments in nbtree.h, but you have to piece
> together all the details to get the overall picture. I wrote this to
> keep my head straight:

v14 uses your diagrams in nbtree.h, and expands some existing
discussion of INCLUDE indexes/non-key attributes/tuple format. Let me
know what you think.

--
Peter Geoghegan

In response to

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons at 2019-03-04 06:02:13 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Paul Ramsey	2019-03-04 21:59:25	Re: Allowing extensions to supply operator-/function-specific info
Previous Message	Justin Pryzby	2019-03-04 21:45:01	Re: Question about pg_upgrade from 9.2 to X.X