From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Date: | 2019-07-06 23:08:21 |
Message-ID: | CAH2-WzkTvMkGRQxmAL84PF3TUCWPotbfR8vFY95HpgvmeuMKFA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jul 4, 2019 at 5:06 AM Anastasia Lubennikova
<a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> The new version of the patch is attached.
> This version is even simpler than the previous one,
> thanks to the recent btree design changes and all the feedback I received.
> I consider it ready for review and testing.
I took a closer look at this patch, and have some general thoughts on
its design, and specific feedback on the implementation.
Preserving the *logical contents* of B-Tree indexes that use
compression is very important -- that should not change in a way that
outside code can notice. The heap TID itself should count as logical
contents here, since we want to be able to implement retail index
tuple deletion in the future. Even without retail index tuple
deletion, amcheck's "rootdescend" verification assumes that it can
find one specific tuple (which could now just be one specific "logical
tuple") using specific key values from the heap, including the heap
tuple's heap TID. This requirement makes things a bit harder for your
patch, because you have to deal with one or two edge-cases that you
currently don't handle: insertion of new duplicates that fall inside
the min/max range of some existing posting list. That should be rare
enough in practice, so the performance penalty won't be too bad. This
probably means that code within _bt_findinsertloc() and/or
_bt_binsrch_insert() will need to think about a logical tuple as a
distinct thing from a physical tuple, though that won't be necessary
in most places.
The need to "preserve the logical contents" also means that the patch
will need to recognize when indexes are not safe as a target for
compression/deduplication (maybe we should call this feature
deduplilcation, so it's clear how it differs from TOAST?). For
example, if we have a case-insensitive ICU collation, then it is not
okay to treat an opclass-equal pair of text strings that use the
collation as having the same value when considering merging the two
into one. You don't actually do that in the patch, but you also don't
try to deal with the fact that such a pair of strings are equal, and
so must have their final positions determined by the heap TID column
(deduplication within _bt_compress_one_page() must respect that).
Possibly equal-but-distinct values seems like a problem that's not
worth truly fixing, but it will be necessary to store metadata about
whether or not we're willing to do deduplication in the meta page,
based on operator class and collation details. That seems like a
restriction that we're just going to have to accept, though I'm not
too worried about exactly what that will look like right now. We can
work it out later.
I think that the need to be careful about the logical contents of
indexes already causes bugs, even with "safe for compression" indexes.
For example, I can sometimes see an assertion failure
within_bt_truncate(), at the point where we check if heap TID values
are safe:
/*
* Lehman and Yao require that the downlink to the right page, which is to
* be inserted into the parent page in the second phase of a page split be
* a strict lower bound on items on the right page, and a non-strict upper
* bound for items on the left page. Assert that heap TIDs follow these
* invariants, since a heap TID value is apparently needed as a
* tiebreaker.
*/
#ifndef DEBUG_NO_TRUNCATE
Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
BTreeTupleGetMinTID(firstright)) < 0);
...
This bug is not that easy to see, but it will happen with a big index,
even without updates or deletes. I think that this happens because
compression can allow the "logical tuples" to be in the wrong heap TID
order when there are multiple posting lists for the same value. As I
said, I think that it's necessary to see a posting list as being
comprised of multiple logical tuples in the context of inserting new
tuples, even when you're not performing compression or splitting the
page. I also see that amcheck's bt_index_parent_check() function
fails, though bt_index_check() does not fail when I don't use any of
its extra verification options. (You haven't updated amcheck, but I
don't think that you need to update it for these basic checks to
continue to work.)
Other feedback on specific things:
* A good way to assess whether or not the "logical tuple versus
physical tuple" thing works is to make sure that amcheck's
"rootdescend" verification works with a variety of indexes. As I said,
it has the same requirements for nbtree as retail index tuple deletion
will.
* _bt_findinsertloc() should not call _bt_compress_one_page() for
!heapkeyspace (version 3) indexes -- the second call to
_bt_compress_one_page() should be removed.
* Why can't compression be used on system catalog indexes? I
understand that they are not a compelling case, but we tend to do
things the same way with catalog tables and indexes unless there is a
very good reason not to (e.g. HOT, suffix truncation). I see that the
tests fail when that restriction is removed, but I don't think that
that has anything to do with system catalogs. I think that that's due
to a bug somewhere else. Why have this restriction at all?
* It looks like we could be less conservative in nbtsplitloc.c to good
effect. We know for sure that a posting list will be truncated down to
one heap TID even in the worst case, so we can safely assume that the
new high key will be a lot smaller than the firstright tuple that it
is based on when it has a posting list. We only have to keep one TID.
This will allow us to leave more tuples on the left half of the page
in certain cases, further improving space utilization.
* Don't you need to update nbtdesc.c?
* Maybe we could do compression with unique indexes when inserting
values with NULLs? Note that we now treat an insertion of a tuple with
NULLs into a unique index as if it wasn't even a unique index -- see
the "checkingunique" optimization at the beginning of _bt_doinsert().
Having many NULL values in a unique index is probably fairly common.
* It looks like amcheck's heapallindexed verification needs to have
normalization added, to avoid false positives. This situation is
specifically anticipated by existing comments above
bt_normalize_tuple(). Again, being careful about "logical versus
physical tuple" seems necessary.
* Doesn't the nbtsearch.c/_bt_readpage() code that deals with
backwards scans need to return posting lists backwards, not forwards?
It seems like a good idea to try to "preserve the logical contents"
here too, just to be conservative.
Within nbtsort.c:
* Is the new code in _bt_buildadd() actually needed? If so, why?
* insert_itupprev_to_page_buildadd() is only called within nbtsort.c,
and so should be static. The name also seems very long.
* add_item_to_posting() is called within both nbtsort.c and
nbtinsert.c, and so should remain non-static, but have less generic
(and shorter) name. (Use the usual _bt_* style instead.)
* Is nbtsort.c the right place for these functions, anyway? (Maybe,
but maybe not, IMV.)
I ran pgindent on the patch, and made some small manual whitespace
adjustments, which is attached. There are no real changes, but some of
the formatting in the original version you posted was hard to read.
Please work off this for your next revision.
--
Peter Geoghegan
Attachment | Content-Type | Size |
---|---|---|
0001-btree_compression_pg12_v1.patch-with-pg_indent-run.patch | application/octet-stream | 55.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2019-07-06 23:31:12 | Re: Declared but no defined functions |
Previous Message | Steven Pousty | 2019-07-06 22:34:56 | Re: Switching PL/Python to Python 3 by default in PostgreSQL 12 |