From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | John Naylor <johncnaylorls(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> |
Subject: | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date: | 2024-01-22 03:27:36 |
Message-ID: | CAD21AoABMpjmicWE7E-9bApvFjU9A3BrUiQLvQFTmpba3bTkGg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jan 19, 2024 at 6:48 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
> > > I'm not quite sure what the point of "num_items" is anymore, because
> > > it was really tied to the array in VacDeadItems. dead_items->num_items
> > > is essential to reading/writing the array correctly. If this number is
> > > wrong, the array is corrupt. There is no such requirement for the
> > > radix tree. We don't need to know the number of tids to add to it or
> > > do a lookup, or anything.
> >
> > True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
> > we need to have the number of TIDs in a tidstore, especially in the
> > tidstore's control object.
>
> Hmm, it would be kind of sad to require explicit locking in tidstore.c
> is only for maintaining that one number at all times. Aside from the
> two ereports after an index scan / second heap pass, the only
> non-assert place where it's used is
>
> @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
> * Do index vacuuming (call each index's ambulkdelete routine), then do
> * related heap vacuuming
> */
> - if (dead_items->num_items > 0)
> + if (TidStoreNumTids(dead_items) > 0)
> lazy_vacuum(vacrel);
>
> ...and that condition can be checked by doing a single step of
> iteration to see if it shows anything. But for the ereport, my idea
> for iteration + popcount is probably quite slow.
Right.
On further thought, as you pointed out before, "num_tids" should not
be in tidstore in terms of integration with tidbitmap.c, because
tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
longer accurate and useful. Similarly, looking at tidbitmap.c, it has
npages and nchunks but they will not be necessary in lazy vacuum use
case. Also, assuming that we support parallel heap pruning, probably
we need to somehow lock the tidstore while adding tids to the tidstore
concurrently by parallel vacuum worker. But in tidbitmap use case, we
don't need to lock the tidstore since it doesn't have multiple
writers. Given these facts, different statistics and different lock
strategies are required by different use case. So I think there are 3
options:
1. expose lock functions for tidstore and the caller manages the
statistics in the outside of tidstore. For example, in lazyvacuum.c we
would have a TidStore for tid storage as well as VacDeadItemsInfo that
has num_tids and max_bytes. Both are in LVRelState. For parallel
vacuum, we pass both to the workers via DSM and pass both to function
where the statistics are required. As for the exposed lock functions,
when adding tids to the tidstore, the caller would need to call
something like TidStoreLockExclusive(ts) that further calls
LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.
2. add callback functions to tidstore so that the caller can do its
work while holding a lock on the tidstore. This is like the idea we
just discussed for radix tree. The caller passes a callback function
and user data to TidStoreSetBlockOffsets(), and the callback is called
after setting tids. Similar to option 1, the statistics need to be
stored in a different area.
3. keep tidstore.c and tidbitmap.c separate implementations but use
radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
control object and doesn't have any lossy page support. On the other
hand, in tidbitmap.c we replace simplehash with radix tree. This makes
tidstore.c simple but we would end up having different data structures
for similar usage.
I think it's worth trying option 1. What do you think, John?
>
> > IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
> > the whole lazy vacuum operation whereas num_items is the number of
> > LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
> > That is, after heap vacuum, the latter counter is reset while the
> > former counter is not.
> >
> > The latter counter is used in lazyvacuum.c as well as the ereport in
> > vac_bulkdel_one_index().
>
> Ah, of course.
>
> > Putting a copy of the key in BlocktableEntry's header is an
> > interesting idea. But the current debug code in the tidstore also
> > makes sure that the tidstore returns TIDs in the correct order during
> > an iterate operation. I think it still has a value and you can disable
> > it by removing the "#define TIDSTORE_DEBUG" line.
>
> Fair enough. I just thought it'd be less work to leave this out in
> case we change how locking is called.
>
> > > This week I tried an idea to use a callback there so that after
> > > internal unlocking, the caller received the value (or whatever else
> > > needs to happen, such as lookup an offset in the tid bitmap). I've
> > > attached a draft for that that passes radix tree tests. It's a bit
> > > awkward, but I'm guessing this would more closely match future
> > > internal atomic locking. Let me know what you think of the concept,
> > > and then do whichever way you think is best. (using v53 as the basis)
> >
> > Thank you for verifying this idea! Interesting. While it's promising
> > in terms of future atomic locking, I'm concerned it might not be easy
> > to use if radix tree APIs supports only such callback style.
>
> Yeah, it's quite awkward. It could be helped by only exposing it for
> varlen types. For simply returning "present or not" (used a lot in the
> regression tests), we could skip the callback if the data is null.
> That is all also extra stuff.
>
> > I believe
> > the caller would like to pass one more data along with val_data. For
>
> That's trivial, however, if I understand you correctly. With "void *",
> a callback can receive anything, including a struct containing
> additional pointers to elsewhere.
>
> > example, considering tidstore that has num_tids internally, it wants
> > to pass both a pointer to BlocktableEntry and a pointer to TidStore
> > itself so that it increments the counter while holding a lock.
>
> Hmm, so a callback to RT_SET also. That's interesting!
>
> Anyway, I agree it needs to be simple, since the first use doesn't
> even have multiple writers.
Right.
>
> > BTW in radixtree.h pg_attribute_unused() is used for some functions,
> > but is it for debugging purposes? I don't see why it's used only for
> > some functions.
>
> It was there to silence warnings about unused functions. I only see
> one remaining, and it's already behind a debug symbol, so we might not
> need this attribute anymore.
Okay.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2024-01-22 03:31:04 | Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs |
Previous Message | Peter Smith | 2024-01-22 03:26:03 | Re: Add last_commit_lsn to pg_stat_database |