Quick Links

Re: [PoC] Improve dead tuple storage for lazy vacuum

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PoC] Improve dead tuple storage for lazy vacuum
Date:	2022-07-05 07:33:17
Message-ID:	CAD21AoC8A0bUU+q7-k9hF7Dop3HuOUZR0x2X853oApEWErxU9A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> > new file mode 100644
> > index 0000000000..bf87f932fd
> > --- /dev/null
> > +++ b/src/backend/lib/radixtree.c
> > @@ -0,0 +1,1763 @@
> > +/*-------------------------------------------------------------------------
> > + *
> > + * radixtree.c
> > + * Implementation for adaptive radix tree.
> > + *
> > + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
> > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
> > + * Neumann, 2013.
> > + *
> > + * There are some differences from the proposed implementation. For instance,
> > + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
> > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> > + * Also, there is no support for path compression and lazy path expansion. The
> > + * radix tree supports fixed length of the key so we don't expect the tree level
> > + * wouldn't be high.
>
> I think we're going to need path compression at some point, fwiw. I'd bet on
> it being beneficial even for the tid case.
>
>
> > + * The key is a 64-bit unsigned integer and the value is a Datum.
>
> I don't think it's a good idea to define the value type to be a datum.

A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately. Since a datum could be 4 bytes or 8 bytes depending it
might not be good for some platforms. But what kind of aspects do you
not like the idea of using datum?

>
>
> > +/*
> > + * As we descend a radix tree, we push the node to the stack. The stack is used
> > + * at deletion.
> > + */
> > +typedef struct radix_tree_stack_data
> > +{
> > + radix_tree_node *node;
> > + struct radix_tree_stack_data *parent;
> > +} radix_tree_stack_data;
> > +typedef radix_tree_stack_data *radix_tree_stack;
>
> I think it's a very bad idea for traversal to need allocations. I really want
> to eventually use this for shared structures (eventually with lock-free
> searches at least), and needing to do allocations while traversing the tree is
> a no-go for that.
>
> Particularly given that the tree currently has a fixed depth, can't you just
> allocate this on the stack once?

Yes, we can do that.

>
> > +/*
> > + * Allocate a new node with the given node kind.
> > + */
> > +static radix_tree_node *
> > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> > +{
> > + radix_tree_node *newnode;
> > +
> > + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> > + radix_tree_node_info[kind].size);
> > + newnode->kind = kind;
> > +
> > + /* update the statistics */
> > + tree->mem_used += GetMemoryChunkSpace(newnode);
> > + tree->cnt[kind]++;
> > +
> > + return newnode;
> > +}
>
> Why are you tracking the memory usage at this level of detail? It's *much*
> cheaper to track memory usage via the memory contexts? Since they're dedicated
> for the radix tree, that ought to be sufficient?

Indeed. I'll use MemoryContextMemAllocated instead.

>
>
> > + else if (idx != n4->n.count)
> > + {
> > + /*
> > + * the key needs to be inserted in the middle of the
> > + * array, make space for the new key.
> > + */
> > + memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
> > + sizeof(uint8) * (n4->n.count - idx));
> > + memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
> > + sizeof(radix_tree_node *) * (n4->n.count - idx));
> > + }
>
> Maybe we could add a static inline helper for these memmoves? Both because
> it's repetitive (for different node types) and because the last time I looked
> gcc was generating quite bad code for this. And having to put workarounds into
> multiple places is obviously worse than having to do it in one place.

Agreed, I'll update it.

>
>
> > +/*
> > + * Insert the key with the val.
> > + *
> > + * found_p is set to true if the key already present, otherwise false, if
> > + * it's not NULL.
> > + *
> > + * XXX: do we need to support update_if_exists behavior?
> > + */
>
> Yes, I think that's needed - hence using bfm_set() instead of insert() in the
> prototype.

Agreed.

>
>
> > +void
> > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> > +{
> > + int shift;
> > + bool replaced;
> > + radix_tree_node *node;
> > + radix_tree_node *parent = tree->root;
> > +
> > + /* Empty tree, create the root */
> > + if (!tree->root)
> > + radix_tree_new_root(tree, key, val);
> > +
> > + /* Extend the tree if necessary */
> > + if (key > tree->max_val)
> > + radix_tree_extend(tree, key);
>
> FWIW, the reason I used separate functions for these in the prototype is that
> it turns out to generate a lot better code, because it allows non-inlined
> function calls to be sibling calls - thereby avoiding the need for a dedicated
> stack frame. That's not possible once you need a palloc or such, so splitting
> off those call paths into dedicated functions is useful.

Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Re: [PoC] Improve dead tuple storage for lazy vacuum at 2022-07-04 21:18:22 from Andres Freund

Responses

Re: [PoC] Improve dead tuple storage for lazy vacuum at 2022-07-05 08:09:23 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Masahiko Sawada	2022-07-05 07:33:29	Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message	Dagfinn Ilmari Mannsåker	2022-07-05 07:31:27	Re: [PATCH] Add result_types column to pg_prepared_statements view