Re: WIP: Covering + unique indexes.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: Covering + unique indexes.
Date: 2018-04-01 22:18:14
Message-ID: CAH2-WzmDjdu+=LTs8fuKn5tSWfyzzF0TMPOLa=0qnWf2=nDdhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>> So? GIN doesn't have the same legacy at all. The GIN posting lists
>> *don't* have regular heap TID pointers at all. They started out
>> without them, and still don't have them.
>
>
> Yes, GIN never stored heap TID pointers in t_tid of index tuple. But GIN
> assumes that heap TID pointer has at most 11 significant bits during
> posting list encoding.

I think that we should avoid assuming things, unless the cost of
representing them is too high, which I don't think applies here. The
more defensive general purpose code can be, the better.

I will admit to being paranoid here. But experience suggests that
paranoia is a good thing, if it isn't too expensive. Look at the
thread on XFS + fsync() for an example of things being wrong for a
very long time without anyone realizing, and despite the best efforts
of many smart people. As far as anyone can tell, PostgreSQL on Linux +
XFS is kinda, sorta broken, and has been forever. XFS was mature
before ext4 was, and is a popular choice, and yet this is the first
we're hearing about it being kind of broken. After many years.

Look at this check that made it into my amcheck patch, that was
committed yesterday:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/amcheck/verify_nbtree.c;h=a15fe21933b9a5b8baefedaa8f38e517d6c91877;hb=7f563c09f8901f6acd72cb8fba7b1bd3cf3aca8e#l745

As it says, nbtree is surprisingly tolerant of corrupt lp_len fields.
You may find it an interesting exercise to use pg_hexedit to corrupt
many lp_len fields in an index page. What's really interesting about
this is that it doesn't appear to break anything at all! We don't get
the length from there in most cases, so reads won't break at all. I
see that we use ItemIdGetLength() in a couple of rare cases (though
even those could be avoided) during a page split. You'd be lucky to
notice a problem if lp_len fields were regularly corrupt. When you
notice, it will probably have already caused big problems.

On a similar note, I've noticed that many of my experimental B-Tree
patches (that I never find time to finish) tend to almost work quite
early on, sometimes without my really understanding why. The whole L&Y
approach of recovering from problems that were detected (detecting
concurrent page splits, and moving right) makes the code *very*
forgiving. I hope that I don't sound trite, but everyone should try to
be modest about what they *don't* know when writing complex system
software with concurrency. It is not a platitude, even though it
probably seems that way. A tiny mistake can have big consequences, so
it's very important that we have a way to easily detect them after the
fact.

> I don't think we should use assertions, because they are typically disabled
> on
> production PostgreSQL builds. But we can have some explicit check in some
> common path. In the attached patch I've such check to _bt_compare().
> Probably,
> together with amcheck, that would be sufficient.

Good idea -- a "can't happen" check in _bt_compare seems better, which
I see here:

> diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
> index 51dca64e13..fcf9832147 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> return 1;
>
> + /*
> + * Check tuple has correct number of attributes.
> + */
> + if (!_bt_check_natts(rel, page, offnum))
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("tuple has wrong number of attributes in index \"%s\"",
> + RelationGetRelationName(rel))));
> + }
> +
> itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));

It seems like it might be a good idea to make this accept an
IndexTuple, though, to possibly save some work. Also, perhaps this
should be an unlikely() condition, if only because it makes the intent
clearer (might actually matter in a tight loop like this too, though).

Do you store an attribute number in the "minus infinity" item (the
leftmost one of internal pages)? I guess that that should be zero,
because it's totally truncated.

> OK, thank for the explanation. I agree that check of offset is redundant
> here.

Cool.

>> The fact is that that information can go out of date almost
>> immediately, whereas high keys usually last forever. The only reason
>> that there is a heap TID in the high key is because we'd have to add
>> special code to remove it; not because it has any real value. I find
>> it very hard to imagine it being used in a forensic situation. If you
>> actually wanted to do this, the key itself is probably enough -- you
>> probably wouldn't need the TID.
>
>
> I don't know, When I wrote my own implementation of B-tree and debug
> it, I found saving hikeys "as is" to be very valuable for debugging.

I would like to see your implementation at some point. That sounds interesting.

> However, B-trees in PostgreSQL are quite mature, and probably
> don't need so much debug information.

Today, the highkey at the leaf level is an exact copy of the right
sibling's first item immediately after the split. The absence of a
usable heap TID offset (due to using it for number of attributes in
high keys) is unlikely to make it harder to locate that right
sibling's first item (to get a full heap TID), which could have moved
a lot further right after the split, or even have been removed
entirely. It could now be ambiguous where it wouldn't have been before
in the event of duplicates, but it's unlikely. And when it does
happen, it's unlikely to matter.

We can still include the heap block number, I suppose. I think of the
highkey as only having one simple job -- separating the keyspace
between siblings. We actually have a very neat choke point to check
that it does that one job -- when a high key is generated for a page
split at the leaf level. If we were doing generic suffix truncation,
we'd add a test that made sure that the high key was strictly greater
than the last item on the left, and strictly less than the first item
on the right. As I said yesterday, I don't like how we allow a highkey
to be equal to both sides of the split, which goes against L&Y, and I
think that we would at least be strict about < and > for suffix
truncation.

The highkey's actual value can be invented, provided it does this one
simple job, which needs to be assessed only once at our "neat choke
point". Everything else falls into place afterwards, since that's
where teh downlink actually comes from. You can check it during a leaf
page split while debugging (that's the neat choke point). That's why
the high key doesn't seem very interesting from a debuggability
perspective.

>> Nobody asked you to write a suffix truncation patch. That has
>> complexity above and beyond what the covering index patch needs. I
>> just expect it to be compatible with an eventual suffix truncation
>> patch, which you've now shown is quite possible. It is clearly a
>> complimentary technique.
>
>
> OK, but change of on-disk tuple format also changes what people
> see in pageinspect. Right now, they see "1" as offset for tuples in intenal
> page and hikeys. After patch, they would see some large values
> (assuming we set some of hi bits) in offset. I'm not sure it's OK.
> We probably should change display of index tuples in pageinspect.

This reminds me of a discussion I had with Robert Haas about
pageinspect + t_infomask bits. Robert thought that we should show the
composite bits as single constants, where we do that (with things like
HEAP_XMIN_FROZEN). I disagreed, saying I think that we should just
show "the bits that are on the page", while also documenting that this
situation exists in pageinspect directly.

I think something similar here. I think it's okay to just show offset,
provided it is documented. We have a number of odd things within
nbtree that I actually saw to it were documented, such as the "minus
infinity" item on internal pages, which looks odd and out of places. I
remember Tatsuo Ishii asked about it before this happened. It seems
helpful to show what's really there, and offer guidance on how to
interpret it. I actually thought carefully about many things like this
for pg_hexedit, which tries to be very consistent and logical, uses
color to suggest meaning, and so on.

Anyway, that's what I think about it, though I wouldn't really care if
I lost that particular argument and we did something special with
internal page offset in pageinspect. It seems like a matter of
opinion, or aesthetics.

> I'm sorry, I do not understand. New version of amcheck determines
> the expected number of attributes and compares that to the numer of
> attributes stored in the offset number. But I can get *expected* number of
> attributes even wihtout storing them also in the offset number...

Maybe I was confused.

> I'd like to note that I really appreciate your attention to this patch
> as well as other patches.

Thanks. I would like to thank Anastasia and you for your patience and
perseverance, despite what I see as mistakes in how this project was
manged. I really want for it to be possible for there to be more
patches in the nbtree code, because they're really needed. That was a
big part of my motivation for writing amcheck, in fact. It's tedious
to link this patch to a bigger picture about what we need to do with
nbtree in the next 5 years, but I think that that's what it will take
to get this patch in. That's my opinion.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Tiikkaja 2018-04-01 22:19:28 Re: Diagonal storage model
Previous Message David Fetter 2018-04-01 21:54:20 Re: Diagonal storage model