From: | Peter Geoghegan <pg(at)heroku(dot)com> |
---|---|
To: | Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [PROPOSAL] Effective storage of duplicates in B-tree index. |
Date: | 2015-09-01 18:23:44 |
Message-ID: | CAM3SWZTQrSdzB_D_hEpuZyObNmr1VXjjYmQz_Jz0bLrgniRevA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
<a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> Now new B-tree index tuple must be inserted for each table row that we
> index.
> It can possibly cause page split. Because of MVCC even unique index could
> contain duplicates.
> Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.
> So it seems to be very useful improvement. Of course it requires a lot of
> changes in B-tree implementation, so I need approval from community.
>
> 1. Compatibility.
> It's important to save compatibility with older index versions.
> I'm going to change BTREE_VERSION to 3.
> And use new (posting) features for v3, saving old implementation for v2.
> Any objections?
It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.
> 2. There are several tricks to handle non-unique keys in B-tree.
> More info in btree readme (chapter - Differences to the Lehman & Yao
> algorithm).
> In the new version they'll become useless. Am I right?
I think that the L&Y algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.
See also, my opinion of how they claim to not need read locks [1].
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L&Y trick of
avoiding lock coupling -- the whole point of L&Y -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.
Anyway, the way that I always imagined this would work is a layer
"below" the current implementation. In other words, you could easily
have prefix compression with a prefix that could end at a point within
a reference IndexTuple. It could be any arbitrary point in the second
or subsequent attribute, and would not "care" about the structure of
the IndexTuple when it comes to where attributes begin and end, etc
(although, in reality, in probably would end up caring, because of the
complexity -- not caring is the ideal only, at least to me). As
Alexander pointed out, GIN does not care about composite keys.
That seems quite different to a GIN posting list (something that I
know way less about, FYI). So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.
> 3. Microvacuum.
> Killed items are marked LP_DEAD and could be deleted from separate page at
> time of insertion.
> Now it's fine, because each item corresponds with separate TID. But posting
> list implementation requires another way. I've got two ideas:
> First is to mark LP_DEAD only those tuples where all TIDs are not visible.
> Second is to add LP_DEAD flag to each TID in posting list(tree). This way
> requires a bit more space, but allows to do microvacuum of posting
> list/tree.
No real opinion on this point, except that I agree that doing
something is necessary.
Couple of further thoughts on this general topic:
* Currently, B-Tree must be able to store at least 3 items on each
page, for the benefit of the L&Y algorithm. You need room for 1
"highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
page is redundant if you cannot get to any child page based on the
scanKey value differing one way or the other (so 2 downlinks are a
sensible minimum), plus a highkey is usually needed (just not on the
rightmost page). As you probably know, we enforce this by making sure
every IndexTuple is no more than 1/3 of the size that will fit.
You should start thinking about how to deal with this in a world where
the physical size could actually be quite variable. The solution is
probably to simply pretend that every IndexTuple is its original size.
This applies to both prefix compression and duplicate suppression, I
suppose.
* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2015-09-01 18:36:03 | Re: Horizontal scalability/sharding |
Previous Message | Andres Freund | 2015-09-01 18:22:49 | Re: Horizontal scalability/sharding |