Quick Links

Re: [PoC] Improve dead tuple storage for lazy vacuum

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc:	Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PoC] Improve dead tuple storage for lazy vacuum
Date:	2023-01-11 03:12:54
Message-ID:	CAFBsxsF8_ce1Vc9HTqjdh4KknPDekjMJTW6JPq99U+GhdzxuqA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
wrote:

> It looks no problem in terms of vacuum integration, although I've not
> fully tested yet. TID store uses the radix tree as the main storage,
> and with the template radix tree, the data types for shared and
> non-shared will be different. TID store can have an union for the
> radix tree and the structure would be like follows:

> /* Storage for Tids */
> union tree
> {
> local_radix_tree *local;
> shared_radix_tree *shared;
> };

We could possibly go back to using a common data type for this, but with
unused fields in each setting, as before. We would have to be more careful
of things like the 32-bit crash from a few weeks ago.

> In the functions of TID store, we need to call either local or shared
> radix tree functions depending on whether TID store is shared or not.
> We need if-branch for each key-value pair insertion, but I think it
> would not be a big performance problem in TID store use cases, since
> vacuum is an I/O intensive operation in many cases.

Also, the branch will be easily predicted. That was still true in earlier
patches, but with many more branches and fatter code paths.

> Overall, I think
> there is no problem and I'll investigate it in depth.

Okay, great. If the separate-functions approach turns out to be ugly, we
can always go back to the branching approach for shared memory. I think
we'll want to keep this as a template overall, at least to allow different
value types and to ease adding variable-length keys if someone finds a need.

> Apart from that, I've been considering the lock support for shared
> radix tree. As we discussed before, the current usage (i.e, only
> parallel index vacuum) doesn't require locking support at all, so it
> would be enough to have a single lock for simplicity.

Right, that should be enough for PG16.

> If we want to
> use the shared radix tree for other use cases such as the parallel
> heap vacuum or the replacement of the hash table for shared buffers,
> we would need better lock support.

For future parallel pruning, I still think a global lock is "probably" fine
if the workers buffer in local arrays. Highly concurrent applications will
need additional work, of course.

> For example, if we want to support
> Optimistic Lock Coupling[1],

Interesting, from the same authors!

> we would need to change not only the node
> structure but also the logic. Which probably leads to widen the gap
> between the code for non-shared and shared radix tree. In this case,
> once we have a better radix tree optimized for shared case, perhaps we
> can replace the templated shared radix tree with it. I'd like to hear
> your opinion on this line.

I'm not in a position to speculate on how best to do scalable concurrency,
much less how it should coexist with the local implementation. It's
interesting that their "ROWEX" scheme gives up maintaining order in the
linear nodes.

> > One review point I'll mention: Somehow I didn't notice there is no use
for the "chunk" field in the rt_node type -- it's only set to zero and
copied when growing. What is the purpose? Removing it would allow the
smallest node to take up only 32 bytes with a fanout of 3, by eliminating
padding.
>
> Oh, I didn't notice that. The chunk field was originally used when
> redirecting the child pointer in the parent node from old to new
> (grown) node. When redirecting the pointer, since the corresponding
> chunk surely exists on the parent we can skip existence checks.
> Currently we use RT_NODE_UPDATE_INNER() for that (see
> RT_REPLACE_NODE()) but having a dedicated function to update the
> existing chunk and child pointer might improve the performance. Or
> reducing the node size by getting rid of the chunk field might be
> better.

I see. IIUC from a brief re-reading of the code, saving that chunk would
only save us from re-loading "parent->shift" from L1 cache and shifting the
key. The cycles spent doing that seem small compared to the rest of the
work involved in growing a node. Expressions like "if (idx < 0) return
false;" return to an asserts-only variable, so in production builds, I
would hope that branch gets elided (I haven't checked).

I'm quite keen on making the smallest node padding-free, (since we don't
yet have path compression or lazy path expansion), and this seems the way
to get there.

--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Re: [PoC] Improve dead tuple storage for lazy vacuum at 2023-01-10 12:07:41 from Masahiko Sawada

Responses

Re: [PoC] Improve dead tuple storage for lazy vacuum at 2023-01-11 09:56:25 from John Naylor
Re: [PoC] Improve dead tuple storage for lazy vacuum at 2023-01-12 05:44:14 from Masahiko Sawada

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2023-01-11 03:25:04	Re: pgsql: Add new GUC createrole_self_grant.
Previous Message	Kyotaro Horiguchi	2023-01-11 03:10:33	Re: mprove tab completion for ALTER EXTENSION ADD/DROP