From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru> |
Subject: | Re: Making all nbtree entries unique by having heap TIDs participate in comparisons |
Date: | 2019-03-04 21:59:13 |
Message-ID: | CAH2-WzkETf6tfQYLg8aqtuy8omAGtg9TDH6ZACS_v7vDghR8wA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Some comments on
> v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
> Mostly about code comments. In general, I think a round of copy-editing
> the comments, to use simpler language, would do good. The actual code
> changes look good to me.
I'm delighted that the code looks good to you, and makes sense
overall. I worked hard to make the patch a natural adjunct to the
existing code, which wasn't easy.
> Seems confusing to first say assertively that "*bufptr contains the page
> that the new tuple unambiguously belongs to", and then immediately go on
> to list a whole bunch of exceptions. Maybe just remove "unambiguously".
This is fixed in v14 of the patch series.
> This happens very seldom, because you only get an incomplete split if
> you crash in the middle of a page split, and that should be very rare. I
> don't think we need to list more fine-grained conditions here, that just
> confuses the reader.
Fixed in v14.
> > /*
> > * _bt_useduplicatepage() -- Settle for this page of duplicates?
> So, this function is only used for legacy pg_upgraded indexes. The
> comment implies that, but doesn't actually say it.
I made that more explicit in v14.
> > /*
> > * Get tie-breaker heap TID attribute, if any. Macro works with both pivot
> > * and non-pivot tuples, despite differences in how heap TID is represented.
> > #define BTreeTupleGetHeapTID(itup) ...
I fixed up the comments above BTreeTupleGetHeapTID() significantly.
> The comment claims that "all pivot tuples must be as of BTREE_VERSION
> 4". I thought that all internal tuples are called pivot tuples, even on
> version 3.
In my mind, "pivot tuple" is a term that describes any tuple that
contains a separator key, which could apply to any nbtree version.
It's useful to have a distinct term (to not just say "separator key
tuple") because Lehman and Yao think of separator keys as separate and
distinct from downlinks. Internal page splits actually split *between*
a separator key and a downlink. So nbtree internal page splits must
split "inside a pivot tuple", leaving its separator on the left hand
side (new high key), and its downlink on the right hand side (new
minus infinity tuple).
Pivot tuples may contain a separator key and a downlink, just a
downlink, or just a separator key (sometimes this is implicit, and the
block number is garbage). I am particular about the terminology
because the pivot tuple vs. downlink vs. separator key thing causes a
lot of confusion, particularly when you're using Lehman and Yao (or
Lanin and Shasha) to understand how things work in Postgres.
We wan't to have a broad term that refers to the tuples that describe
the keyspace (pivot tuples), since it's often helpful to refer to them
collectively, without seeming to contradict Lehman and Yao.
> I think what this means to say is that this macro is only
> used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
> have a heap TID in BTREE_VERSION 4 indexes.
My high level approach to pg_upgrade/versioning is for index scans to
*pretend* that every nbtree index (even on v2 and v3) has a heap
attribute that actually makes the keys unique. The difference is that
v4 gets to use a scantid, and actually rely on the sort order of heap
TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the
heap attribute", and must never provide a scantid (they also cannot
use the !minusinfkey optimization, but this is only an optimization
that v4 indexes don't truly need). They always do the right thing
(move left) on otherwise-equal pivot tuples, since they have no
scantid.
That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring
about the version the index uses. It might be NULL for a pivot tuple
in a v3 index, even though we imagine/pretend that it should have a
value set. But that doesn't matter, because higher level code knows
that !heapkeyspace indexes should never get a scantid (_bt_compare()
does Assert() that they got that detail right, though). We "have no
reason to peak", because we don't have a scantid, so index scans work
essentially the same way, regardless of the version in use.
There are a few specific cross-version things that we need think about
outside of making sure that there is never a scantid (and !minusinfkey
optimization is unused) in < v4 indexes, but these are all related to
unique indexes. "Pretending that all indexes have a heap TID" is a
very useful mental model. Nothing really changes, even though you
might guess that changing the classic "Subtree S is described by Ki <
v <= Ki+1" invariant would need to break code in
_bt_binsrch()/_bt_compare(). Just pretend that the classic invariant
was there since the Berkeley days, and don't do anything that breaks
the useful illusion on versions before v4.
> This macro (and many others in nbtree.h) is quite complicated. A static
> inline function might be easier to read.
I agree that the macros are complicated, but that seems to be because
the rules are complicated. I'd rather leave the macros in place, and
improve the commentary on the rules.
> 'xlmeta.version' is set incorrectly.
Oops. Fixed in v14.
> I find this comment difficult to read. I suggest rewriting it to:
>
> /*
> * The current Btree version is 4. That's what you'll get when you create
> * a new index.
I used your wording for this in v14, almost verbatim.
> Now that the index tuple format becomes more complicated, I feel that
> there should be some kind of an overview explaining the format. All the
> information is there, in the comments in nbtree.h, but you have to piece
> together all the details to get the overall picture. I wrote this to
> keep my head straight:
v14 uses your diagrams in nbtree.h, and expands some existing
discussion of INCLUDE indexes/non-key attributes/tuple format. Let me
know what you think.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Paul Ramsey | 2019-03-04 21:59:25 | Re: Allowing extensions to supply operator-/function-specific info |
Previous Message | Justin Pryzby | 2019-03-04 21:45:01 | Re: Question about pg_upgrade from 9.2 to X.X |