Re: [PoC] Improve dead tuple storage for lazy vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Date: 2024-03-18 04:12:07
Message-ID: CAD21AoA_v8UN08dm==jRRD7cN3ObuowxUkXUbkN1eOK6486yMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > > overwrite) the value, I think we should not expect that found is
> > > > always false.
> > >
> > > I find that a puzzling statement, since 1) it was designed for
> > > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > > now fail if the same block is set twice, since we just switched the
> > > tests to use a remnant of vacuum's old array. Having said that, I
> > > don't object to removing artificial barriers to using it for purposes
> > > not yet imagined, as long as test_tidstore.sql warns against that.
> >
> > I think that if it supports only insert-only workload and expects the
> > same block is set only once, it should raise an error rather than an
> > assertion. It's odd to me that the function fails only with an
> > assertion build assertions even though it actually works fine even in
> > that case.
>
> After thinking some more, I think you're right -- it's too
> heavy-handed to throw an error/assert and a public function shouldn't
> make assumptions about the caller. It's probably just a matter of
> documenting the function (and it's lack of generality), and the tests
> (which are based on the thing we're replacing).

Removed 'found' in 0003 patch.

>
> > As for test_tidstore you're right that the test code doesn't handle
> > the case where setting the same block twice. I think that there is no
> > problem in the fixed-TIDs tests, but we would need something for
> > random-TIDs tests so that we don't set the same block twice. I guess
> > it could be trivial since we can use SQL queries to generate TIDs. I'm
> > not sure how the random-TIDs tests would be like, but I think we can
> > use SELECT DISTINCT to eliminate the duplicates of block numbers to
> > use.
>
> Also, I don't think we need random blocks, since the radix tree tests
> excercise that heavily already.
>
> Random offsets is what I was thinking of (if made distinct and
> ordered), but even there the code is fairy trivial, so I don't have a
> strong feeling about it.

Agreed.

>
> > > Given the above two things, I think this function's comment needs
> > > stronger language about its limitations. Perhaps even mention that
> > > it's intended for, and optimized for, vacuum. You and I have long
> > > known that tidstore would need a separate, more complex, function to
> > > add or remove individual tids from existing entries, but it might be
> > > good to have that documented.
> >
> > Agreed.
>
> How about this:
>
> /*
> - * Set the given TIDs on the blkno to TidStore.
> + * Create or replace an entry for the given block and array of offsets
> *
> - * NB: the offset numbers in offsets must be sorted in ascending order.
> + * NB: This function is designed and optimized for vacuum's heap scanning
> + * phase, so has some limitations:
> + * - The offset numbers in "offsets" must be sorted in ascending order.
> + * - If the block number already exists, the entry will be replaced --
> + * there is no way to add or remove offsets from an entry.
> */
> void
> TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,

Looks good.

>
> I think we can stop including the debug-tid-store patch for CI now.
> That would allow getting rid of some unnecessary variables.

Agreed.

>
> + * Prepare to iterate through a TidStore. Since the radix tree is locked during
> + * the iteration, TidStoreEndIterate() needs to be called when finished.
>
> + * Concurrent updates during the iteration will be blocked when inserting a
> + * key-value to the radix tree.
>
> This is outdated. Locking is optional. The remaining real reason now
> is that TidStoreEndIterate needs to free memory. We probably need to
> say something about locking, too, but not this.

Fixed.

>
> + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> + * in one block. We return the block numbers in ascending order and the offset
> + * numbers in each result is also sorted in ascending order.
> + */
> +TidStoreIterResult *
> +TidStoreIterateNext(TidStoreIter *iter)
>
> The wording is a bit awkward.

Fixed.

>
> +/*
> + * Finish an iteration over TidStore. This needs to be called after finishing
> + * or when existing an iteration.
> + */
>
> s/existing/exiting/ ?
>
> It seems to say we need to finish after finishing. Maybe more precise wording.

Fixed.

>
> +/* Extract TIDs from the given key-value pair */
> +static void
> +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> BlocktableEntry *page)
>
> This is a leftover from the old encoding scheme. This should really
> take a "BlockNumber blockno" not a "key", and the only call site
> should probably cast the uint64 to BlockNumber.

Fixed.

>
> + * tidstore.h
> + * Tid storage.
> + *
> + *
> + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
>
> Update year.

Updated.

>
> +typedef struct BlocktableEntry
> +{
> + uint16 nwords;
> + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> +} BlocktableEntry;
>
> In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
> That doesn't have any real-world affect on the largest offset
> encountered, and only in 32-bit builds with 32kB block size would the
> theoretical max change at all. To be precise, we could use in the
> MaxBlocktableEntrySize calculation:
>
> Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);

I don't get this expression. Making the nwords one byte works well?
With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
OS, respectively. One byte nwrods variable seems not to be sufficient
for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
PG_INT8_MAX - 1" come from?

>
> Tests: I never got rid of maxblkno and maxoffset, in case you wanted
> to do that. And as discussed above, maybe
>
> -- Note: The test code use an array of TIDs for verification similar
> -- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
> -- each call to do_set_block_offsets() should use different block
> -- numbers.

I've added this comment on top of the .sql file.

I've attached the new patch sets. The summary of updates is:

- Squashed all updates of v72
- 0004 and 0005 are updates for test_tidstore.sql. Particularly the
0005 patch adds randomized TID tests.
- 0006 addresses review comments above.
- 0007 and 0008 patches are pgindent stuff.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v73-ART.tar.gz application/x-gzip 26.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2024-03-18 04:28:42 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message Amit Kapila 2024-03-18 04:07:27 Re: speed up a logical replica setup