From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> |
Cc: | Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: WIP: Covering + unique indexes. |
Date: | 2018-03-31 05:39:40 |
Message-ID: | CAH2-WznTAqhwwS6UGQpXCqiXxG3dFxK+zsT7AHzipHKqaCL20Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Mar 30, 2018 at 4:08 PM, Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>> I'll try it. But I'm afraid that it's not as easy as you expect.
>
>
> So, I have some implementation of storage of number of attributes inside
> index tuple itself. I made it as additional patch on top of previous
> patchset.
> I attach the whole patchset in order to make commitfest.cputube.org happy.
Looks like 0004-* became mangled. Can you send a version that is not
mangled, please?
> I decided not to use 13th bit of IndexTuple flags. Instead I use only high
> bit
> of offset which is also always free on regular tuples. In fact, we already
> use
> assumption that there is at most 11 significant bits of index tuple offset
> in
> GIN (see ginpostinglist.c).
So? GIN doesn't have the same legacy at all. The GIN posting lists
*don't* have regular heap TID pointers at all. They started out
without them, and still don't have them.
> Anastasia also pointed that if we're going to do on-disk changes, they
> should be compatible not only with suffix truncation, but also with
> duplicate
> compression (which was already posted in thread [1]).
I definitely agree with that, and I think that Anastasia should push
for whatever will make future nbtree enhancements easier, especially
her own pending or planned enhancements.
> However, I think
> there is no problem. We can use one of 3 free bits in offset as flag that
> it's tuple with posting list. Duplicates compression needs to store
> number of posting list items and their offset in the tuple. Free bits
> left in item pointer after reserving 2 bits (1 flag of alternative meaning
> of offset and 1 flag of posting list) is far enough for that.
The issue that I see is that we could easily make this unambiguous,
free of any context, with a tiny bit more work. Why not just do it
that way?
Maybe it won't actually matter, but I see no reason not to do it, since we can.
> However, I find following arguments against implementing this feature
> in covering indexes.
>
> * We write number of attributes present into tuple. But how to prove that
> it's correct. I add appropriate checks to amcheck. But I don't think all
> the
> users runs amcheck frequent enough. Thus, in order to be sure that it's
> correct we should check number of attributes is written correct everywhere
> in the B-tree code.
Use an assertion. Problem solved.
I agree that people aren't using amcheck all that much, but give it
time. Oracle and SQL Server have had tools like amcheck for 30+ years.
We have had amcheck for one year.
> Without that we can face the situation that we've
> introduced new on-disk representation better to further B-tree enhancements,
> but actually it's broken. And that would be much worse than nothing.
> In order to check number of attributes everywhere in the B-tree code, we
> need to actually implement significant part of suffix compression. And I
> really think we shouldn't do this as part as covering indexes patch.
I don't think that you need to do that, actually. I'm not asking you
to go to those lengths. I have only asked that you make the on-disk
representation *compatible* with a future Postgres version that has
full suffix truncation (and other such enhancements, too). I care
about the on-disk representation more than the code.
> * Offset number is used now for parent refind (see BTEntrySame() macro).
> In the attached patch, this condition is relaxed. But I don't think I
> really like
> that. This shoud be thought out very carefully...
It's safe, although I admit that that's a bit hard to see.
Specifically, look at this code in _bt_insert_parent():
/*
* Find the parent buffer and get the parent page.
*
* Oops - if we were moved right then we need to change stack item! We
* want to find parent pointing to where we are, right ? - vadim
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
Vadim doesn't seem too sure of why he did it that way. What's clear is
that the offset on all internal pages is always P_HIKEY (that is, 1),
because this is the one and only place where new IndexTuples get
generated for internal pages. That's unambiguous. So how could
BTEntrySame() truly need to care about offset? How could there ever be
an internal page offset that wasn't just P_HIKEY? You can look
yourself, using pg_hexedit or pageinspect.
The comments above BTTidSame()/BTEntrySame() are actually wrong,
including "New Comments". Vadim wanted to make TIDs part of the
keyspace [1], beginning in around 1997. The idea was that we'd have
truly unique keys by including TID, as L&Y intended, but that never
happened. Instead, we got commit 9e85183bf in 2000, which among many
other things changed the L&Y invariant to deal with duplicates. I
think that Tom should have changed BTTidSame() to not care about
offset number in that same commit from 2000.
I actually think that Vadim was correct to want to make heap TID a
unique-ifier, and that that's the best long term solution [2].
Unfortunately, the code that he committed in the late 1990s didn't
really help -- how could it help without including the *entire* heap
TID? This BTTidSame() offset thing seems to be related to some weird
logic for duplicates that Tom killed in 9e85183bf, if it ever made
sense. Note that _bt_getstackbuf(), the only code that uses
BTEntrySame(), does not look at the offset directly -- because it's
always P_HIKEY.
Anyway...
> * Now, hikeys are copied together with original t_tid's. That makes it
> possible
> to find the origin of this hikey. If we override offset in t_tid, that
> becomes not
> always possible.
....that just leaves the original high key at the leaf level, as you
say here. You're right that there is theoretically a loss of forensic
information from actually storing something in the offset at the leaf
level, and storing something interesting in the offset during the
first phase of a page split (not the second, where the aforementioned
_bt_insert_parent() function gets called). I don't think it's worth
worrying about, though.
The fact is that that information can go out of date almost
immediately, whereas high keys usually last forever. The only reason
that there is a heap TID in the high key is because we'd have to add
special code to remove it; not because it has any real value. I find
it very hard to imagine it being used in a forensic situation. If you
actually wanted to do this, the key itself is probably enough -- you
probably wouldn't need the TID.
> * When index tuple is truncated, then pageinspect probably shouldn't show
> offset for it, because it meaningless. Should it rather show number of
> attributes in separate column? Anyway that should be part of suffix
> truncation
> patch. Not part of covering indexes patch, especially added at the last
> moment.
Nobody asked you to write a suffix truncation patch. That has
complexity above and beyond what the covering index patch needs. I
just expect it to be compatible with an eventual suffix truncation
patch, which you've now shown is quite possible. It is clearly a
complimentary technique.
> * I don't really see how does covering indexes without storing number of
> index tuple attributes in the tuple itself blocks future work on suffix
> truncation.
It makes it harder. Your new version gives amcheck a way of
determining the expected number of attributes. That's the main reason
to have it, more so than the suffix truncation issue. Suffix
truncation matters a lot too, though.
> So, taking into account the arguments of above, I propose to give up with
> idea to stick covering indexes and suffix truncation features together.
> That wouldn't accelerate appearance one feature after another, but rather
> likely would RIP both of them...
I think that the thing that's more likely to kill this patch is the
fact that after the first year, it only ever got discussed in the
final CF. That's not something that happened because of my choices. I
made several offers of my time. I did not create this urgency.
[1] https://www.postgresql.org/message-id/18788.963953289@sss.pgh.pa.us
[2] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2018-03-31 05:46:10 | Re: WIP: Covering + unique indexes. |
Previous Message | Amit Kapila | 2018-03-31 05:38:37 | Re: [HACKERS][PATCH] adding simple sock check for windows |