Re: WIP: Covering + unique indexes.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alexander Kuzmenkov <a(dot)kuzmenkov(at)postgrespro(dot)ru>
Subject: Re: WIP: Covering + unique indexes.
Date: 2018-04-07 02:12:53
Message-ID: CAH2-Wzk=xU1dDP7B_gq0S052OHeezEtibZ-BMF=YT-W1L+LWBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 6, 2018 at 11:08 AM, Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> OK, incorporated into v15. I've also added sentence about pg_upgrade
> to the commit message.

I will summarize my feelings on this patch. I endorse committing the
patch, because I think that the benefits of committing it now
noticeably outweigh the costs. I have various caveats about pushing
the patch, but these are manageable.

Costs
=====

First, there is the question of risks, or costs. I think that this
patch has a negligible chance of being problematic in a way that will
become memorable. That seems improbable because the patch only really
changes the representation of what we're calling "pivot keys" (high
keys and internal page downlinks), which is something that VACUUM
doesn't care about. I see this patch as a special case of suffix
truncation, a technique that has been around since the 1970s. Although
you have to look carefully to see it, the amount of extra complexity
is pretty small, and the only place where a critical change is made is
during leaf page splits. As long as we get that right, everything else
should fall into place. There are no risks that I can see that are
related to concurrency, or that crop up when doing an anti-wraparound
VACUUM. There may be problems, but at least they won't be *pernicious*
problems that unravel over a long period of time.

The latest amcheck enhancement, and Alexander's recent changes to the
patch to make the on-disk representation explicit (not implicit)
should change things. We now have the tools to detect any corruption
problem that I can think of. For example, if there was some subtle
reason why assessing HOT safety broke, then we'd have a way of
mechanically detecting that without having to devise a custom test
(like the test Pavan happened to be using when the bug fixed by
2aaec654 was originally discovered). The lessons that I applied to
designing amcheck were in a few cases from actual experience with real
world bugs, including that 2aaec654 bug.

I hope that it goes without saying that I've also taken reasonable
steps to address all of these risks directly, by auditing code. And,
that this remains the first line of defense.

Here are the other specific issues that I see with the patch:

* It's possible that something was missed in the optimizer. I'm not sure.

I share the intuition that very little code is actually needed there,
but I'm far from the best person to judge whether or not some subtle
detail was missed.

* This seems out of date:

> + * NOTE: It is not crucial for reliability in present, but maybe
> + * it will be that in the future. Now the purpose is just to save
> + * more space on inner pages of btree.

* CheckIndexCompatible() didn't seem to get the memo about this patch.
Maybe just a comment?

* It's possible that there are some more bugs in places like
relcache.c, or deparsing, or pg_dump, or indexcmds.c; perhaps simple
omissions, like the one I just mentioned. If there are, I don't expect
them to be particularly serious, or to make me reassess my basic
position. But there could be.

* I was wrong to suggest _bt_isequal() has an assertion against
truncation. It is called for the highkey. Suggest you weaken the
assertion, so it only applies when the offset isn't P_HIKEY on
non-rightmost page.

* Suggest adding a comment above BTStackData, about bts_btentry + offset.

* Suggest breaking BTEntrySame() into 3 lines, not 2.

* This comment needs to be updated:

/* get high key from left page == lowest key on new right page */

Suggest "get high key from left page == lower bound for new right page".

* This comment needs to be updated:

13th bit: unused

Suggest "13th bit: AM-defined meaning"

* Suggest adding a note that the use of P_HIKEY here is historical,
since it isn't used to match downlinks:

/*
* Find the parent buffer and get the parent page.
*
* Oops - if we were moved right then we need to change stack item! We
* want to find parent pointing to where we are, right ? - vadim
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);

* I'm slightly concerned that this patch subtly breaks an optimization
within _bt_preprocess_keys(), or _bt_checkkeys(). I cannot find any
evidence of that, though, and I consider it unlikely, based on the
intuition that the simple Pathkey changes in the optimizer don't
provide the executor with a truly new set of constraints for index
scans. Also, if there was a problem here, it would be in the less
serious category of problems -- those that can't really affect anyone
not using the user-visible feature.

* The docs need some more polishing. Didn't spend very much time on this at all.

Benefits
========

There is also the matter of the benefits of this patch, that I think
are considerable, and far greater than they appear. This feature is a
great way to begin to add a broad variety of enhancements to nbtree
that we really need.

* The patch makes index-only scans a lot more compelling.

There are a couple of reasons why it's better to create indexes that
index perhaps as many as 4 or 7 columns to target index-only scans in
other database systems. I think that fan-out may be the main one. The
disadvantage that we have around HOT safety compared to other systems
seem less likely to be the problem when that many columns are
involved, and yet this is something that Oracle/SQL Server people do
frequently, and Postgres people don't really do at all. This one thing
that suffix truncation improves automatically, but INCLUDE indexes can
make that general situation a lot better than truncation alone ever
could.

If you have an index where most columns are INCLUDE columns, and
compare that to an index with the same attributes that are indexed in
the conventional way, then I believe that you will have far fewer
problems with index bloat in some important cases. Apart from
everything else, this provides us with the opportunity to learn how to
mitigate index bloat problems in real world conditions, even without
INCLUDE indexes. We need to get smarter about problems with index
bloat.

* Suffix truncation works on the same principle, and is enabled by
this work. It's prerequisite to making nbtree use the classic L&Y
approach, that assumes that all items in the index are unique.

We could just add heap TID to pivot tuples today, as an "extra"
column, while sorting on TID at the leaf level. This would make TID a
first class part of the key space -- a "unique-ifier", as L&Y
intended. But doing so naively would add enormous overhead, which
would simply be unacceptable. However, once we have suffix truncation,
the overhead is eliminated in virtually all cases. We get to move to
the classic L&Y invariant, simplifying the code, and we have a solid
basis for adding "retail index tuple deletion", which I believe is
almost essentially for zheap. There is a good chance that Postgres
B-Trees are the only implementation in the world that doesn't have
truly unique keys. The design of nbtree would become a whole lot more
elegant if we could restore the classic "Ki < v <= Ki+1" invariant, as
Vadim intended over 20 years ago.

Somebody has to bite the bullet and start changing the representation
of pivot tuples to get these benefits (and many more). This seems like
an ideal place to start that process. I think that what we have here
addresses concerns from Tom [1], in particular.

The patch has been marked "Ready for Committer". While this patch is
primarily the responsibility of the committer, presumably Teodor in
this case, I will take some of the responsibility for the patch after
commit. Certainly, because I see the patch as strategically important,
I am willing to spend quite a lot of time after feature freeze, to
make sure that it is in good shape. I have a general interest in
making sure that amcheck gains acceptance as a way of validating a
complicated patch like this one after commit.

[1] https://www.postgresql.org/message-id/15195.1490988897%40sss.pgh.pa.us

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-04-07 02:22:20 Re: PATCH: Configurable file mode mask
Previous Message David Rowley 2018-04-07 01:55:37 Re: [HACKERS] path toward faster partition pruning