Quick Links

Re: [PoC] Improve dead tuple storage for lazy vacuum

From:	Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc:	Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PoC] Improve dead tuple storage for lazy vacuum
Date:	2022-10-27 02:11:15
Message-ID:	CAD21AoB+z3+T2EnOBVXok2aeRywrDZDFkwwEaWKZSrFz9s1+mw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Oct 26, 2022 at 8:06 PM John Naylor
<john(dot)naylor(at)enterprisedb(dot)com> wrote:
>
> On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > I've attached updated PoC patches for discussion and cfbot. From the
> > previous version, I mainly changed the following things:
> >

Thank you for the comments!

> > * Separate treatment of inner and leaf nodes
>
> Overall, this looks much better!
>
> > * Pack both the node kind and node count to an uint16 value.
>
> For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking again at the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum is a good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like this example:
>
> node4: 4 + 4 + 4*8 = 40
> node4: 5 + 4+(7) + 4*8 = 48 bytes
>
> Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes from kind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded in a pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might be 6 temporarily if I work on size decoupling first).

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

>
> (Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need to write your own shifting/masking code).

Thanks!

>
> > I've not done SIMD part seriously yet. But overall the performance
> > seems good so far. If we agree with the current approach, I think we
> > can proceed with the verification of decoupling node sizes from node
> > kind. And I'll investigate DSA support.
>
> Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently with the above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still be some performance tuning and cosmetic work, but it's getting closer.
>

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

> -------------------------
> 0001:
>
> +#ifndef USE_NO_SIMD
> +#include "port/pg_bitutils.h"
> +#endif
>
> Leftover from an earlier version?
>
> +static inline int vector8_find(const Vector8 v, const uint8 c);
> +static inline int vector8_find_ge(const Vector8 v, const uint8 c);
>
> Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

>
> +#else /* USE_NO_SIMD */
> + Vector8 r = 0;
> + uint8 *rp = (uint8 *) &r;
> +
> + for (Size i = 0; i < sizeof(Vector8); i++)
> + rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
> +
> + return r;
> +#endif
>
> As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off writing their own byte-wise loop rather than using this API. Especially since the "min" function exists only as a workaround for lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that idiom for non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm.

Agreed. Will remove non-SIMD code.

>
> 0002:
>
> + /* XXX: should not to use vector8_highbit_mask */
> + bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
>
> Hmm?

It's my outdated memo, will remove.

>
> +/*
> + * Return index of the first element in chunks in the given node that is greater
> + * than or equal to 'key'. Return -1 if there is no such element.
> + */
> +static inline int
> +node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
>
> The caller must now have logic for inserting at the end:
>
> + int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
> + int16 count = NODE_GET_COUNT(n32);
> +
> + if (insertpos < 0)
> + insertpos = count; /* insert to the tail */
>
> It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for example). In fact, these functions are probably better named node*_get_insertpos().

Agreed.

>
> + if (likely(NODE_HAS_FREE_SLOT(n128)))
> + {
> + node_inner_128_insert(n128, chunk, child);
> + break;
> + }
> +
> + /* grow node from 128 to 256 */
>
> We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close together. This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the desired effect, so we need to write like this (also see prototype):
>
> if (unlikely( ! has-free-slot))
> grow-node;
> else
> {
> ...;
> break;
> }
> /* FALLTHROUGH */

Good point. Will change.

>
> + /* Descend the tree until a leaf node */
> + while (shift >= 0)
> + {
> + rt_node *child;
> +
> + if (NODE_IS_LEAF(node))
> + break;
> +
> + if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
> + child = rt_node_add_new_child(tree, parent, node, key);
> +
> + Assert(child);
> +
> + parent = node;
> + node = child;
> + shift -= RT_NODE_SPAN;
> + }
>
> Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing there (the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in mind as we proceed. Maybe a comment about it to remind us.

Agreed. Currently rt_extend() is used to add upper nodes but probably
we need another function to add lower nodes for this case.

>
> + /* there is no key to delete */
> + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
> + return false;
> +
> + /* Update the statistics */
> + tree->num_keys--;
> +
> + /*
> + * Delete the key from the leaf node and recursively delete the key in
> + * inner nodes if necessary.
> + */
> + Assert(NODE_IS_LEAF(stack[level]));
> + while (level >= 0)
> + {
> + rt_node *node = stack[level--];
> +
> + if (NODE_IS_LEAF(node))
> + rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
> + else
> + rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
> +
> + /* If the node didn't become empty, we stop deleting the key */
> + if (!NODE_IS_EMPTY(node))
> + break;
> +
> + /* The node became empty */
> + rt_free_node(tree, node);
> + }
>
> Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls are inlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf bookkeepping and loop over the inner nodes. This might require some duplication of code.

Agreed.

>
> +ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
>
> Spelling

WIll fix.

>
> +static inline void
> +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> + uint8 *dst_chunks, rt_node **dst_children, int count)
> +{
> + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> +}
>
> gcc generates better code with something like this (but not hard-coded) at the top:
>
> if (count > 4)
> pg_unreachable();

Agreed.

>
> This would have to change when we implement shrinking of nodes, but might still be useful.
>
> + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
> + return false;
> +
> + return true;
>
> Maybe just "return rt_node_search_leaf(...)" ?

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Re: [PoC] Improve dead tuple storage for lazy vacuum at 2022-10-26 11:06:43 from John Naylor

Responses

Re: [PoC] Improve dead tuple storage for lazy vacuum at 2022-10-27 03:21:40 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Justin Pryzby	2022-10-27 02:14:37	Re: GUC values - recommended way to declare the C variables?
Previous Message	Michael Paquier	2022-10-27 02:06:56	Re: GUC values - recommended way to declare the C variables?