Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-01-04 18:42:15
Message-ID: CAH2-Wz=fnMJ8-z4iAyL2X_x-giiOg82+RCRS-PXSeW3P+OM5tQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Alexander,

On Fri, Jan 4, 2019 at 7:40 AM Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> I'm starting to look at this patchset. Not ready to post detail
> review, but have a couple of questions.

Thanks for taking a look!

> Yes, it shouldn't be too hard, but it seems like we have to keep two
> branches of code for different handling of duplicates. Is that true?

Not really. If you take a look at v9, you'll see the approach I've
taken is to make insertion scan keys aware of which rules apply (the
"heapkeyspace" field field controls this). I think that there are
about 5 "if" statements for that outside of amcheck. It's pretty
manageable.

I like to imagine that the existing code already has unique keys, but
nobody ever gets to look at the final attribute. It works that way
most of the time -- the only exception is insertion with user keys
that aren't unique already. Note that the way we move left on equal
pivot tuples, rather than right (rather than following the pivot's
downlink) wasn't invented by Postgres to deal with the lack of unique
keys. That's actually a part of the Lehman and Yao design itself.
Almost all of the special cases are optimizations rather than truly
necessary infrastructure.

> I didn't get the point of this paragraph. Does it might happen that
> first right tuple is under tuple size restriction, but new pivot tuple
> is beyond that restriction? If so, would we have an error because of
> too long pivot tuple? If not, I think this needs to be explained
> better.

The v9 version of the function _bt_check_third_page() shows what it
means (comments on this will be improved in v10, too). The old limit
of 2712 bytes still applies to pivot tuples, while a new, lower limit
of 2704 bytes applied for non-pivot tuples. This difference is
necessary because an extra MAXALIGN() quantum could be needed to add a
heap TID to a pivot tuple during truncation in the worst case. To
users, the limit is 2704 bytes, because that's the limit that actually
needs to be enforced during insertion.

We never actually say "1/3 of a page means 2704 bytes" in the docs,
since the definition was always a bit fuzzy. There will need to be a
compatibility note in the release notes, though.
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-01-04 19:24:15 Re: Arrays of domain returned to client as non-builtin oid describing the array, not the base array type's oid
Previous Message Tom Lane 2019-01-04 17:26:18 Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)