From: | Thom Brown <thom(at)linux(dot)com> |
---|---|
To: | Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [WIP] Effective storage of duplicates in B-tree index. |
Date: | 2016-01-28 17:03:02 |
Message-ID: | CAA-aLv5aQezrfh2iJD97R1sQ6nqMsR2QZ_YAzNfAOkkMgO8CFQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 28 January 2016 at 16:12, Anastasia Lubennikova <
a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
>
> 28.01.2016 18:12, Thom Brown:
>
> On 28 January 2016 at 14:06, Anastasia Lubennikova <
> a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
>
>>
>> 31.08.2015 10:41, Anastasia Lubennikova:
>>
>> Hi, hackers!
>> I'm going to begin work on effective storage of duplicate keys in B-tree
>> index.
>> The main idea is to implement posting lists and posting trees for B-tree
>> index pages as it's already done for GIN.
>>
>> In a nutshell, effective storing of duplicates in GIN is organised as
>> follows.
>> Index stores single index tuple for each unique key. That index tuple
>> points to posting list which contains pointers to heap tuples (TIDs). If
>> too many rows having the same key, multiple pages are allocated for the
>> TIDs and these constitute so called posting tree.
>> You can find wonderful detailed descriptions in gin readme
>> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
>> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
>> It also makes possible to apply compression algorithm to posting
>> list/tree and significantly decrease index size. Read more in presentation
>> (part 1)
>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>>
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index could
>> contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>>
>>
>> I'd like to share the progress of my work. So here is a WIP patch.
>> It provides effective duplicate handling using posting lists the same way
>> as GIN does it.
>>
>> Layout of the tuples on the page is changed in the following way:
>> before:
>> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
>> (ip_blkid, ip_posid) + key
>> with patch:
>> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
>> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>>
>> It seems that backward compatibility works well without any changes. But
>> I haven't tested it properly yet.
>>
>> Here are some test results. They are obtained by test functions
>> test_btbuild and test_ginbuild, which you can find in attached sql file.
>> i - number of distinct values in the index. So i=1 means that all rows
>> have the same key, and i=10000000 means that all keys are different.
>> The other columns contain the index size (MB).
>>
>> i B-tree Old B-tree New GIN
>> 1 214,234375 87,7109375 10,2109375
>> 10 214,234375 87,7109375 10,71875
>> 100 214,234375 87,4375 15,640625
>> 1000 214,234375 86,2578125 31,296875
>> 10000 214,234375 78,421875 104,3046875
>> 100000 214,234375 65,359375 49,078125
>> 1000000 214,234375 90,140625 106,8203125
>> 10000000 214,234375 214,234375 534,0625
>> You can note that the last row contains the same index sizes for B-tree,
>> which is quite logical - there is no compression if all the keys are
>> distinct.
>> Other cases looks really nice to me.
>> Next thing to say is that I haven't implemented posting list compression
>> yet. So there is still potential to decrease size of compressed btree.
>>
>> I'm almost sure, there are still some tiny bugs and missed functions, but
>> on the whole, the patch is ready for testing.
>> I'd like to get a feedback about the patch testing on some real datasets.
>> Any bug reports and suggestions are welcome.
>>
>> Here is a couple of useful queries to inspect the data inside the index
>> pages:
>> create extension pageinspect;
>> select * from bt_metap('idx');
>> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
>> n) as bt;
>> select n, bt.* from generate_series(1,1) as n, lateral
>> bt_page_items('idx', n) as bt;
>>
>> And at last, the list of items I'm going to complete in the near future:
>> 1. Add storage_parameter 'enable_compression' for btree access method
>> which specifies whether the index handles duplicates. default is 'off'
>> 2. Bring back microvacuum functionality for compressed indexes.
>> 3. Improve insertion speed. Insertions became significantly slower with
>> compressed btree, which is obviously not what we do want.
>> 4. Clean the code and comments, add related documentation.
>>
>
> This doesn't apply cleanly against current git head. Have you caught up
> past commit 65c5fcd35?
>
>
> Thank you for the notice. New patch is attached.
>
Thanks for the quick rebase.
Okay, a quick check with pgbench:
CREATE INDEX ON pgbench_accounts(bid);
Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms
Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)
No performance issues from what I can tell.
I'm surprised that efficiencies can't be realised beyond this point. Your
results show a sweet spot at around 1000 / 10000000, with it getting
slightly worse beyond that. I kind of expected a lot of efficiency where
all the values are the same, but perhaps that's due to my lack of
understanding regarding the way they're being stored.
Thom
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2016-01-28 17:09:36 | Re: [WIP] Effective storage of duplicates in B-tree index. |
Previous Message | David G. Johnston | 2016-01-28 16:55:33 | Re: Request - repeat value of \pset title during \watch interations |