Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2021-07-26 14:01:46
Message-ID: CAD21AoDy2rmw7dGaGoFqkzxsd+kFg2ttz-bV1D8i=epCXrfWWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 26, 2021 at 1:07 AM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
>
> Hi,
>
> I've dreamed to write more compact structure for vacuum for three
> years, but life didn't give me a time to.
>
> Let me join to friendly competition.
>
> I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

Thank you for proposing the new idea!

>
> Novelties:
> - 32 consecutive pages are stored together in a single sparse array
> (called "chunks").
> Chunk contains:
> - its number,
> - 4 byte bitmap of non-empty pages,
> - array of non-empty page headers 2 byte each.
> Page header contains offset of page's bitmap in bitmaps container.
> (Except if there is just one dead tuple in a page. Then it is
> written into header itself).
> - container of concatenated bitmaps.
>
> Ie, page metadata overhead varies from 2.4byte (32pages in single
> chunk)
> to 18byte (1 page in single chunk) per page.
>
> - If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
> it is compressed by removing zero byte and indexing with two-level
> bitmap index.
> Two-level index - zero bytes in first level are removed using
> second level. It is mostly done for 32kb pages, but let it stay since
> it is almost free.
>
> - If page's bitmaps contains a lot of "all-one" bytes, it is inverted
> and then encoded as sparse.
>
> - Chunks are allocated with custom "allocator" that has no
> per-allocation overhead. It is possible because there is no need
> to perform "free": allocator is freed as whole at once.
>
> - Array of pointers to chunks is also bitmap indexed. It saves cpu time
> when not every 32 consecutive pages has at least one dead tuple.
> But consumes time otherwise. Therefore additional optimization is
> added
> to quick skip lookup for first non-empty run of chunks.
> (Ahhh, I believe this explanation is awful).

It sounds better than my proposal.

>
> Andres Freund wrote 2021-07-20 02:49:
> > Hi,
> >
> > On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
> >> BTW is the implementation of the radix tree approach available
> >> somewhere? If so I'd like to experiment with that too.
> >>
> >> >
> >> > I have toyed with implementing adaptively large radix nodes like
> >> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
> >> > gotten it quite working.
> >>
> >> That seems promising approach.
> >
> > I've since implemented some, but not all of the ideas of that paper
> > (adaptive node sizes, but not the tree compression pieces).
> >
> > E.g. for
> >
> > select prepare(
> > 1000000, -- max block
> > 20, -- # of dead tuples per page
> > 10, -- dead tuples interval within a page
> > 1 -- page inteval
> > );
> > attach size shuffled ordered
> > array 69 ms 120 MB 84.87 s 8.66 s
> > intset 173 ms 65 MB 68.82 s 11.75 s
> > rtbm 201 ms 67 MB 11.54 s 1.35 s
> > tbm 232 ms 100 MB 8.33 s 1.26 s
> > vtbm 162 ms 58 MB 10.01 s 1.22 s
> > radix 88 ms 42 MB 11.49 s 1.67 s
> >
> > and for
> > select prepare(
> > 1000000, -- max block
> > 10, -- # of dead tuples per page
> > 1, -- dead tuples interval within a page
> > 1 -- page inteval
> > );
> >
> > attach size shuffled ordered
> > array 24 ms 60MB 3.74s 1.02 s
> > intset 97 ms 49MB 3.14s 0.75 s
> > rtbm 138 ms 36MB 0.41s 0.14 s
> > tbm 198 ms 101MB 0.41s 0.14 s
> > vtbm 118 ms 27MB 0.39s 0.12 s
> > radix 33 ms 10MB 0.28s 0.10 s
> >
> > (this is an almost unfairly good case for radix)
> >
> > Running out of time to format the results of the other testcases before
> > I have to run, unfortunately. radix uses 42MB both in test case 3 and
> > 4.
>
> My results (Ubuntu 20.04 Intel Core i7-1165G7):
>
> Test1.
>
> select prepare(1000000, 10, 20, 1); -- original
>
> attach size shuffled
> array 29ms 60MB 93.99s
> intset 93ms 49MB 80.94s
> rtbm 171ms 67MB 14.05s
> tbm 238ms 100MB 8.36s
> vtbm 148ms 59MB 9.12s
> radix 100ms 42MB 11.81s
> svtm 75ms 29MB 8.90s
>
> select prepare(1000000, 20, 10, 1); -- Andres's variant
>
> attach size shuffled
> array 61ms 120MB 111.91s
> intset 163ms 66MB 85.00s
> rtbm 236ms 67MB 10.72s
> tbm 290ms 100MB 8.40s
> vtbm 190ms 59MB 9.28s
> radix 117ms 42MB 12.00s
> svtm 98ms 29MB 8.77s
>
> Test2.
>
> select prepare(1000000, 10, 1, 1);
>
> attach size shuffled
> array 31ms 60MB 4.68s
> intset 97ms 49MB 4.03s
> rtbm 163ms 36MB 0.42s
> tbm 240ms 100MB 0.42s
> vtbm 136ms 27MB 0.36s
> radix 60ms 10MB 0.72s
> svtm 39ms 6MB 0.19s
>
> (Bad radix result probably due to smaller cache in notebook's CPU ?)
>
> Test3
>
> select prepare(1000000, 2, 100, 1);
>
> attach size shuffled
> array 6ms 12MB 53.42s
> intset 23ms 16MB 54.99s
> rtbm 115ms 38MB 8.19s
> tbm 186ms 100MB 8.37s
> vtbm 105ms 59MB 9.08s
> radix 64ms 42MB 10.41s
> svtm 73ms 10MB 7.49s
>
> Test4
>
> select prepare(1000000, 100, 1, 1);
>
> attach size shuffled
> array 304ms 600MB 75.12s
> intset 775ms 98MB 47.49s
> rtbm 356ms 38MB 4.11s
> tbm 539ms 100MB 4.20s
> vtbm 493ms 42MB 4.44s
> radix 263ms 42MB 6.05s
> svtm 360ms 8MB 3.49s
>
> Therefore Specialized Vaccum Tid Map always consumes least memory amount
> and usually faster.

I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2021-07-26 14:12:41 Re: SQL/JSON: JSON_TABLE
Previous Message Dean Rasheed 2021-07-26 13:51:57 Re: WIP: Relaxing the constraints on numeric scale