Re: [PROPOSAL] Effective storage of duplicates in B-tree index.

From: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PROPOSAL] Effective storage of duplicates in B-tree index.
Date: 2015-09-03 15:35:13
Message-ID: 55E868B1.3040205@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

01.09.2015 21:23, Peter Geoghegan:
> On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
> <a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index could
>> contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
> I'm glad someone is thinking about this, because it is certainly
> needed. I thought about working on it myself, but there is always
> something else to do. I should be able to assist with review, though.
Thank you)
>> So it seems to be very useful improvement. Of course it requires a lot of
>> changes in B-tree implementation, so I need approval from community.
>>
>> 1. Compatibility.
>> It's important to save compatibility with older index versions.
>> I'm going to change BTREE_VERSION to 3.
>> And use new (posting) features for v3, saving old implementation for v2.
>> Any objections?
> It might be better to just have a flag bit for pages that are
> compressed -- there are IIRC 8 free bits in the B-Tree page special
> area flags variable. But no real opinion on this from me, yet. You
> have plenty of bitspace to work with to mark B-Tree pages, in any
> case.
>
Hmm.. If we are talking about storing duplicates in posting lists (and
trees) as in GIN, I don't see a way how to apply it to separate pages,
while not applying to others. Look some notes below .

>> 2. There are several tricks to handle non-unique keys in B-tree.
>> More info in btree readme (chapter - Differences to the Lehman & Yao
>> algorithm).
>> In the new version they'll become useless. Am I right?
> I think that the L&Y algorithm makes assumptions for the sake of
> simplicity, rather than because they really believed that there were
> real problems. For example, they say that deletion can occur offline
> or something along those lines, even though that's clearly
> impractical. They say that because they didn't want to write a paper
> about deletion within B-Trees, I suppose.
>
> See also, my opinion of how they claim to not need read locks [1].
> Also, note that despite the fact that the GIN README mentions "Lehman
> & Yao style right links", it doesn't actually do the L&Y trick of
> avoiding lock coupling -- the whole point of L&Y -- so that remark is
> misleading. This must be why B-Tree has much better concurrency than
> GIN in practice.

Yes, thanks for extensive explanation.
I mean such tricks as moving right in _bt_findinsertloc(), for example.

/*----------
* If we will need to split the page to put the item on this page,
* check whether we can put the tuple somewhere to the right,
* instead. Keep scanning right until we
* (a) find a page with enough free space,
* (b) reach the last page where the tuple can legally go, or
* (c) get tired of searching.
* (c) is not flippant; it is important because if there are many
* pages' worth of equal keys, it's better to split one of the early
* pages than to scan all the way to the end of the run of equal keys
* on every insert. We implement "get tired" as a random choice,
* since stopping after scanning a fixed number of pages wouldn't work
* well (we'd never reach the right-hand side of previously split
* pages). Currently the probability of moving right is set at 0.99,
* which may seem too high to change the behavior much, but it does an
* excellent job of preventing O(N^2) behavior with many equal keys.
*----------
*/

If there is no multiple tuples with the same key, we shouldn't care
about it at all. It would be possible to skip these steps in "effective
B-tree implementation". That's why I want to change btree_version.

> So I'm really talking about a slightly
> different thing -- prefix compression, rather than handling
> duplicates. Whether or not you should do prefix compression instead of
> deduplication is certainly not clear to me, but it should be
> considered. Also, I always imagined that prefix compression would use
> the highkey as the thing that is offset for each "real" IndexTuple,
> because it's there anyway, and that's simple. However, I suppose that
> that means that duplicate handling can't really work in a way that
> makes duplicates have a fixed cost, which may be a particularly
> important property to you.

You're right, that is two different techniques.
1. Effective storing of duplicates, which I propose, works with equal
keys. And allow us to delete repeats.
Index tuples are stored like this:

IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) |
IndexTupleData + Attrs (key)

If all Attrs are equal, it seems reasonable not to repeat them. So we
can store it in following structure:

MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData

It is a posting list. It doesn't require significant changes in index
page layout, because we can use ordinary IndexTupleData for meta
information. Each IndexTupleData has fixed size, so it's easy to handle
posting list as an array.

2. Prefix compression handles different keys and somehow compresses them.
I think that it will require non-trivial changes in btree index tuples
representation. Furthermore, any compression leads to extra
computations. Now, I don't have clear idea about how to implement this
technique.

> * Currently, B-Tree must be able to store at least 3 items on each
> page, for the benefit of the L&Y algorithm. You need room for 1
> "highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
> page is redundant if you cannot get to any child page based on the
> scanKey value differing one way or the other (so 2 downlinks are a
> sensible minimum), plus a highkey is usually needed (just not on the
> rightmost page). As you probably know, we enforce this by making sure
> every IndexTuple is no more than 1/3 of the size that will fit.
That is the point where too big posting list transforms to a posting
tree. But I think, that in the first patch, I'll do it another way. Just
by splitting long posting list into 2 lists of appropriate length.

> * Since everything is aligned within B-Tree, it's probably worth
> considering the alignment boundaries when doing prefix compression, if
> you want to go that way. We can probably imagine a world where
> alignment is not required for B-Tree, which would work on x86
> machines, but I can't see it happening soon. It isn't worth
> compressing unless it compresses enough to cross an "alignment
> boundary", where we're not actually obliged to store as much data on
> disk. This point may be obvious, not sure.

That is another reason, why I doubt prefix compression, whereas
effective duplicate storage hasn't this problem.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2015-09-03 15:42:21 Re: Proposal: Implement failover on libpq connect level.
Previous Message Robert Haas 2015-09-03 15:33:38 Re: Hooking at standard_join_search (Was: Re: Foreign join pushdown vs EvalPlanQual)