Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2022-06-28 13:09:59
Message-ID: CAFBsxsHWW4gX5ze46Q4aPUZGP0T9dSJaUDdq5DMT6rqyWU-W8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > I
> > suspect other optimizations would be worth a lot more than using AVX2:
> > - collapsing inner nodes
> > - taking care when constructing the key (more on this when we
> > integrate with VACUUM)
> > ...and a couple Andres mentioned:
> > - memory management: in
> > https://www.postgresql.org/message-id/flat/20210717194333.mr5io3zup3kxahfm%40alap3.anarazel.de
> > - node dispatch:
> > https://www.postgresql.org/message-id/20210728184139.qhvx6nbwdcvo63m6%40alap3.anarazel.de
> >
> > Therefore, I would suggest that we use SSE2 only, because:
> > - portability is very easy
> > - to avoid a performance hit from indirecting through a function pointer
>
> Okay, I'll try these optimizations and see if the performance becomes better.

FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).

> I've done benchmark tests while changing the node types. The code base
> is v3 patch that doesn't have the optimization you mentioned below
> (memory management and node dispatch) but I added the code to use SSE2
> for node-16 and node-32.

Great, this is helpful to visualize what's going on!

> * sse2_4_16_48_256
> * nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
> * nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50
>
> * sse2_4_32_128_256
> * nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
> * nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

> Observations are:
>
> In both test cases, There is not much difference between using AVX2
> and SSE2. The more mode types, the more time it takes for loading the
> data (see sse2_4_16_32_128_256).

Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.

> In dense case, since most nodes have around 100 children, the radix
> tree that has node-128 had a good figure in terms of memory usage. On

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases. I'm curious to hear your thoughts.

> the other hand, the radix tree that doesn't have node-128 has a better
> number in terms of insertion performance. This is probably because we
> need to iterate over 'isset' flags from the beginning of the array in
> order to find an empty slot when inserting new data. We do the same
> thing also for node-48 but it was better than node-128 as it's up to
> 48.

I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...

> In terms of lookup performance, the results vary but I could not find
> any common pattern that makes the performance better or worse. Getting
> more statistics such as the number of each node type per tree level
> might help me.

I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2022-06-28 13:20:22 Re: Separate the attribute physical order from logical order
Previous Message Isaac Morland 2022-06-28 13:07:58 Re: Making the subquery alias optional in the FROM clause