Re: [PoC] Improve dead tuple storage for lazy vacuum

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2023-12-04 08:21:16
Message-ID: CANWCAZac=pBePg3rhX8nXkUuaLoiAJJLtmnCfZsPEAS4EtJ=kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:

> Since the variable-length values support is a big deal and would be
> related to API design I'd like to discuss the API design first.

Thanks for the fine summary of the issues here.

[Swapping this back in my head]

> RT_VALUE_TYPE
> RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
> or for variable-length value support,
> RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
>
> If an entry already exists, return its pointer and set "found" to
> true. Otherwize, insert an empty value with sz bytes, return its
> pointer, and set "found" to false.

> ---
> bool
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> or for variable-length value support,
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
>
> If an entry already exists, update its value to 'value_p' and return
> true. Otherwise set the value and return false.

I'd have to double-check, but I think RT_SET is vestigial and I'm not
sure it has any advantage over RT_GET as I've sketched it out. I'm
pretty sure it's only there now because changing the radix tree
regression tests is much harder than changing TID store.

> Given variable-length value support, RT_GET() would have to do
> repalloc() if the existing value size is not big enough for the new
> value, but it cannot as the radix tree doesn't know the size of each
> stored value.

I think we have two choices:

- the value stores the "length". The caller would need to specify a
function to compute size from the "length" member. Note this assumes
there is an array. I think both aspects are not great.
- the value stores the "size". Callers that store an array (as
PageTableEntry's do) would compute length when they need to. This
sounds easier.

> Another idea is that the radix tree returns the pointer
> to the slot and the caller updates the value accordingly.

I did exactly this in v43 TidStore if I understood you correctly. If I
misunderstood you, can you clarify?

> But it means
> that the caller has to update the slot properly while considering the
> value size (embedded vs. single-leave value), which seems not a good
> idea.

For this optimization, callers will have to know about pointer-sized
values and treat them differently, but they don't need to know the
details about how where they are stored.

While we want to keep embedded values in the back of our minds, I
really think the details should be postponed to a follow-up commit.

> To deal with this problem, I think we can somewhat change RT_GET() API
> as follow:
>
> RT_VALUE_TYPE
> RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
>
> If the entry already exists, replace the value with a new empty value
> with sz bytes and set "found" to true. Otherwise, insert an empty
> value, return its pointer, and set "found" to false.
>
> We probably will find a better name but I use RT_INSERT() for
> discussion. RT_INSERT() returns an empty slot regardless of existing
> values. It can be used to insert a new value or to replace the value
> with a larger value.

For the case we are discussing, bitmaps, updating an existing value is
a bit tricky. We need the existing value to properly update it with
set or unset bits. This can't work in general without a lot of work
for the caller.

However, for vacuum, we have all values that we need up front. That
gives me an idea: Something like this insert API could be optimized
for "insert-only": If we only free values when we free the whole tree
at the end, that's a clear use case for David Rowley's proposed "bump
context", which would save 8 bytes per allocation and be a bit faster.
[1] (RT_GET for varlen values would use an aset context, to allow
repalloc, and nodes would continue to use slab).

[1] https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w(at)mail(dot)gmail(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nikita Malakhov 2023-12-04 08:40:51 Re: Avoid detoast overhead when possible
Previous Message Pavel Borisov 2023-12-04 08:12:18 Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)