From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | John Naylor <john(dot)naylor(at)enterprisedb(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date: | 2022-07-05 07:33:17 |
Message-ID: | CAD21AoC8A0bUU+q7-k9hF7Dop3HuOUZR0x2X853oApEWErxU9A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:
> > diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
> > new file mode 100644
> > index 0000000000..bf87f932fd
> > --- /dev/null
> > +++ b/src/backend/lib/radixtree.c
> > @@ -0,0 +1,1763 @@
> > +/*-------------------------------------------------------------------------
> > + *
> > + * radixtree.c
> > + * Implementation for adaptive radix tree.
> > + *
> > + * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
> > + * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
> > + * Neumann, 2013.
> > + *
> > + * There are some differences from the proposed implementation. For instance,
> > + * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
> > + * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
> > + * Also, there is no support for path compression and lazy path expansion. The
> > + * radix tree supports fixed length of the key so we don't expect the tree level
> > + * wouldn't be high.
>
> I think we're going to need path compression at some point, fwiw. I'd bet on
> it being beneficial even for the tid case.
>
>
> > + * The key is a 64-bit unsigned integer and the value is a Datum.
>
> I don't think it's a good idea to define the value type to be a datum.
A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately. Since a datum could be 4 bytes or 8 bytes depending it
might not be good for some platforms. But what kind of aspects do you
not like the idea of using datum?
>
>
> > +/*
> > + * As we descend a radix tree, we push the node to the stack. The stack is used
> > + * at deletion.
> > + */
> > +typedef struct radix_tree_stack_data
> > +{
> > + radix_tree_node *node;
> > + struct radix_tree_stack_data *parent;
> > +} radix_tree_stack_data;
> > +typedef radix_tree_stack_data *radix_tree_stack;
>
> I think it's a very bad idea for traversal to need allocations. I really want
> to eventually use this for shared structures (eventually with lock-free
> searches at least), and needing to do allocations while traversing the tree is
> a no-go for that.
>
> Particularly given that the tree currently has a fixed depth, can't you just
> allocate this on the stack once?
Yes, we can do that.
>
> > +/*
> > + * Allocate a new node with the given node kind.
> > + */
> > +static radix_tree_node *
> > +radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
> > +{
> > + radix_tree_node *newnode;
> > +
> > + newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
> > + radix_tree_node_info[kind].size);
> > + newnode->kind = kind;
> > +
> > + /* update the statistics */
> > + tree->mem_used += GetMemoryChunkSpace(newnode);
> > + tree->cnt[kind]++;
> > +
> > + return newnode;
> > +}
>
> Why are you tracking the memory usage at this level of detail? It's *much*
> cheaper to track memory usage via the memory contexts? Since they're dedicated
> for the radix tree, that ought to be sufficient?
Indeed. I'll use MemoryContextMemAllocated instead.
>
>
> > + else if (idx != n4->n.count)
> > + {
> > + /*
> > + * the key needs to be inserted in the middle of the
> > + * array, make space for the new key.
> > + */
> > + memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
> > + sizeof(uint8) * (n4->n.count - idx));
> > + memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
> > + sizeof(radix_tree_node *) * (n4->n.count - idx));
> > + }
>
> Maybe we could add a static inline helper for these memmoves? Both because
> it's repetitive (for different node types) and because the last time I looked
> gcc was generating quite bad code for this. And having to put workarounds into
> multiple places is obviously worse than having to do it in one place.
Agreed, I'll update it.
>
>
> > +/*
> > + * Insert the key with the val.
> > + *
> > + * found_p is set to true if the key already present, otherwise false, if
> > + * it's not NULL.
> > + *
> > + * XXX: do we need to support update_if_exists behavior?
> > + */
>
> Yes, I think that's needed - hence using bfm_set() instead of insert() in the
> prototype.
Agreed.
>
>
> > +void
> > +radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
> > +{
> > + int shift;
> > + bool replaced;
> > + radix_tree_node *node;
> > + radix_tree_node *parent = tree->root;
> > +
> > + /* Empty tree, create the root */
> > + if (!tree->root)
> > + radix_tree_new_root(tree, key, val);
> > +
> > + /* Extend the tree if necessary */
> > + if (key > tree->max_val)
> > + radix_tree_extend(tree, key);
>
> FWIW, the reason I used separate functions for these in the prototype is that
> it turns out to generate a lot better code, because it allows non-inlined
> function calls to be sibling calls - thereby avoiding the need for a dedicated
> stack frame. That's not possible once you need a palloc or such, so splitting
> off those call paths into dedicated functions is useful.
Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2022-07-05 07:33:29 | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Previous Message | Dagfinn Ilmari Mannsåker | 2022-07-05 07:31:27 | Re: [PATCH] Add result_types column to pg_prepared_statements view |