Quick Links

Re: WIP: Covering + unique indexes.

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: WIP: Covering + unique indexes.
Date:	2018-04-04 22:09:11
Message-ID:	CAPpHfdtv7PXnt3Se4Etr94f13Q=VYBzXg5X0Vvdw7FtA9_aZ3A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi!

Thank you for review! Revised patchset is attached.

On Wed, Apr 4, 2018 at 6:08 AM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:

> On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov
> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> > Great, I'm looking forward your feedback.
>
> I took a look at V11 (0001-Covering-core-v11.patch,
> 0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch,
> 0004-Covering-natts-v11.patch) today.
>
> * What's a pivot tuple?
>
> This is the same thing as what I call a "separator key", I think --
> you're talking about the set of IndexTuples including all high keys
> (including leaf level high keys), as well as internal items
> (downlinks). I think that it's a good idea to have a standard word
> that describes this set of keys, to formalize the two categories
> (pivot tuples vs. tuples that point to the heap itself). Your word is
> just as good as mine, so we can go with that.
>

Good, let's use "pivot tuple" term.

Let's put this somewhere central. Maybe in the nbtree README, and/or
> nbtree.h. Also, verify_nbtree.c should probably get some small
> explanation of pivot tuples. offset_is_negative_infinity() is a nice
> place to mention pivot tuples, since that already has a bit of
> high-level commentary about them.
>

I've added some explanation to nbtree README, nbtree.h and
offset_is_negative_infinity().

> * Compiler warning:
>
> /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:
> In function ‘index_create’:
> /home/pg/postgresql/root/build/../source/src/backend/catalog
> /index.c:476:45:
> warning: ‘opclassTup’ may be used uninitialized in this function
> [-Wmaybe-uninitialized]
> if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID)
> ^
> /home/pg/postgresql/root/build/../source/src/backend/catalog
> /index.c:332:19:
> note: ‘opclassTup’ was declared here
> Form_pg_opclass opclassTup;
> ^
>

Thank yo for pointing, fixed.

> * Your new amcheck tests should definitely use the new
> "heapallindexed" option. There were a number of bugs I can remember
> seeing in earlier versions of this patch that that would catch
> (probably not during regression tests, but let's at least do that
> much).
>

Good point. Tests with "heapallindexed" were added. I also find that it's
useful to
check both index built by sorting, and index built by insertions, because
there are
different ways of forming tuples.

* The modified amcheck contrib regression tests don't actually pass. I
> see these unexpected errors:
>
> 10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple
> attributes for index "bttest_multi_idx"
> 10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to
> index tid=(289,2) page lsn=0/162407A8.
> 10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple
> attributes for index "bttest_multi_idx"
> 10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to
> index tid=(289,2) page lsn=0/162407A8.
>

Right. Sorry, that appears that I've posted patch with non-working
regression tests.
Not they seem to pass.

* I see that we use "- 1" with attribute number, like this:
>
> > +/* Get number of attributes in B-tree index tuple */
> > +#define BtreeTupGetNAtts(itup, index) \
> > + ( \
> > + (itup)->t_info & INDEX_ALT_TID_MASK ? \
> > + ( \
> > + AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) &
> BT_RESERVED_OFFSET_MASK) == 0), \
> > + ItemPointerGetOffsetNumber(&(itup)->t_tid) &
> BT_N_KEYS_OFFSET_MASK - 1 \
> > + ) \
> > + : \
> > + IndexRelationGetNumberOfAttributes(index) \
> > + )
>
> Is this left behind from before you decided to adopt
> INDEX_ALT_TID_MASK? Is it your intention here to encode
> InvalidOffsetNumber() without tripping up assertions? Or is it
> something else?
>
> Maybe we should follow the example of GinItemPointerGetOffsetNumber(),
> and use ItemPointerGetOffsetNumberNoCheck() instead of
> ItemPointerGetOffsetNumber(). What do you think? That would allow us
> to get rid of the -1 thing, which might be nice. Just because we use
> ItemPointerGetOffsetNumberNoCheck() in places that use an alternative
> offset representation does not mean we need to use it in existing
> places. If existing places had a regression tests failure because of
> this, that would probably be due to a real bug. No?
>

Ok. I've tried to remove both assertions and "+1" hack. That works
for me. However, I've to touch a lot of places, not sure if that's a
problem.

* ISTM that the "points to index tid=(289,2)" part of the message just
> shown would be a bit clearer if I didn't have to know that 2 actually
> means 1 when we talk about the pointed-to offset (yeah, it will
> probably become unclear in the future when we start using the reserved
> offset status bits, but why not make the low bits of offset
> simple/logical way?). Your new amcheck error message should spell it
> out (it should say the number of attributes indicated by the offset,
> if any) -- regardless of what we do about the "must apply - 1 to
> offset" question.
>

Right, since error is related to number of attributes, then we should report
observed number of attributes explicitly here.

* "Minus infinity" items do not have the new status bit
> INDEX_ALT_TID_MASK set in at least some cases. They should.
>
> * _bt_sortaddtup() should not do "trunctuple.t_info =
> sizeof(IndexTupleData)", since that destroys useful information. Maybe
> that's the reason for the last bug?
>
> * Ditto for _bt_pgaddtup().
>

Yes, "minus infinity" items hadn't new status bit INDEX_ALT_TID_MASK set.
And that's because errors in _bt_sortaddtup() and _bt_pgaddtup() that you
pointed. Fixed. thanks.

* Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call
> it? The only reason we have _bt_sortaddtup() is because we cannot
> trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context
> of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real
> change needed, because _bt_sortaddtup() knows that it's inserting on a
> non-rightmost page both without this patch, and when this patch needs
> to truncate and then add the high key back.
>
> It's clear that you can just use _bt_sortaddtup() (and leave
> _bt_pgaddtup() private) because _bt_sortaddtup() is only different to
> _bt_pgaddtup() when !P_ISLEAF(), but we only call _bt_pgaddtup() when
> P_ISLEAF(). Or have I missed something?
>

Agreed. I also see no point of exposing _bt_pgaddtup(). I've replaced it
with _bt_sortaddtup(), and it appears to work.

* For inserts, this patch performs an extra truncation step on the
> same high key that we'd use with a plain (non-covering/include) index.
> That's pretty clean. But it seems more complicated for
> nbtsort.c/_bt_buildadd(). I think that a comment should say that we
> cannot just rearrange item pointers for high key on the old page when
> we also truncate, because overwriting the P_HIKEY position ItemId with
> the old page's former final ItemId (whose tuple ended up becoming the
> first tuple on new/right page) fails to actually save any space. We
> need to truly shift around IndexTuples on the page in order to save
> space (both PageIndexTupleDelete() and PageAddItem() end up shifting
> both the ItemId array and some IndexTuple space).
>
> Also, maybe say that the performance here really isn't so bad, because
> we reclaim IndexTuple space close to the middle of the hole in the
> page with our PageIndexTupleDelete(), and then use almost the *same*
> space within PageAddItem(). There is not actually that much physical
> shifting around for IndexTuples. It turns out that it's not that
> different. (You can probably find a better, more succinct way of
> putting this -- I'm tired now.)
>

I wrote some comment there. Please, check it.

> * I suggest that you teach _bt_check_natts() to expect zero attributes
> for "minus infinity" items. It looks like amcheck contrib regression
> tests don't pass because you don't look for that (P_FIRSTDATAKEY() is
> the "minus infinity" item on internal pages).
>

Sure, thank you for catching.

> * bt_target_page_check() should also have a !P_ISLEAF() check, since
> with a covering index every tuple will have INDEX_ALT_TID_MASK. This
> should call _bt_check_natts() for each item, including the "minus
> infinity" items.
>

Yes, every item including the "minus infinity" should be checked for
number of attributes. However, I didn't get how that related to
!P_ISLEAF().
In order to check "minus infinity" item I've just pull _bt_check_natts() up
before offset_is_negative_infinity() check.

Regarding !P_ISLEAF(), I think we should check every item on both
leaf and non-leaf pages. I that is how code now works unless I'm missing
something.

* "minus infinity" items don't have the right number of attributes
> set, in at least some cases that I saw. The number matched other
> internal items, and wasn't 0 or whatever. Maybe the
> ItemPointerGetOffsetNumberNoCheck() idea would leave things so that it
> actually could be 0 safely, rather than natts + 1 as you said, which
> would be nice.)
>

Yes, "minus infinity" items didn't have number of attributes set, because
_bt_sortaddtup() and _bt_pgaddtup() didn't handle it as you pointed
above.

> * I would reorder the comment to match the order of the code:
>
> > + /*
> > + * Pivot tuples stored in non-leaf pages and hikeys of leaf pages
> should
> > + * have nkeyatts number of attributes. While regular tuples of leaf
> pages
> > + * should have natts number of attributes.
> > + */
> > + if (P_ISLEAF(opaque) && offnum >= P_FIRSTDATAKEY(opaque))
> > + return (BtreeTupGetNAtts(itup, index) == natts);
> > + else
> > + return (BtreeTupGetNAtts(itup, index) == nkeyatts);
>

Thanks for pointing. Since there are now three cases including handling of
"minus infinity" items, comment is now split to three.

* Please add BT_N_KEYS_OFFSET_MASK + INDEX_MAX_KEYS static assertion.
> Maybe add it to _bt_check_natts().
>

Done.

* README-SSI says:
>
> * The effects of page splits, overflows, consolidations, and
> removals must be carefully reviewed to ensure that predicate locks
> aren't "lost" during those operations, or kept with pages which could
> get re-used for different parts of the index.
>
> Do we need to worry about that here? I guess not, because this is just
> like having many duplicates. But a note just above the _bt_doinsert()
> call to CheckForSerializableConflictIn() might be a good idea.
>

I don't seerelations between this patchset and SSI. We just
change representation of some index tuples in pages. However,
we didn't change the the order of page modification, the order
of page lookup and so on. Yes, we change size of some tuples,
but B-tree already worked with tuples of variable sizes. So, the fact
that tuples now have different size shouldn't affect SSI. Right now,
I'm not sure if CheckForSerializableConflictIn() just above the
_bt_doinsert() is good idea. But even if so, I think that should be
a subject of separate patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
0001-Covering-core-v12.patch	application/octet-stream	109.1 KB
0002-Covering-btree-v12.patch	application/octet-stream	57.7 KB
0003-Covering-amcheck-v12.patch	application/octet-stream	5.3 KB
0004-Covering-natts-v12.patch	application/octet-stream	24.7 KB

In response to

Re: WIP: Covering + unique indexes. at 2018-04-04 03:08:32 from Peter Geoghegan

Responses

Re: WIP: Covering + unique indexes. at 2018-04-05 02:40:17 from Peter Geoghegan
Re: WIP: Covering + unique indexes. at 2018-04-05 14:02:35 from Erik Rijkers

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Thomas Munro	2018-04-04 22:14:24	Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Previous Message	Andres Freund	2018-04-04 21:44:42	Re: comments around heap_lock_tuple confus{ing,ed} around deleted tuples