Re: Parallel heap vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: John Naylor <johncnaylorls(at)gmail(dot)com>
Cc: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel heap vacuum
Date: 2025-02-25 01:14:19
Message-ID: CAD21AoBXGdsyh3txq1vUjUSu29r=uEzVwhrcs9-g2aWX9w02QQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Feb 23, 2025 at 8:51 PM John Naylor <johncnaylorls(at)gmail(dot)com> wrote:
>
> On Tue, Feb 18, 2025 at 1:11 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > Right. I'm inclined to support only the second heap pass as the first
> > step. If we support parallelism only for the second pass, it cannot
> > help speed up freezing the entire table in emergency situations, but
> > it would be beneficial for cases where a big table have a large amount
> > of spread garbage.
>
> I started looking at the most recent patch set for this, and while
> looking back over the thread, a couple random thoughts occurred to me:
>
> On Sat, Oct 26, 2024 at 2:26 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > - Patched
> > parallel 0: 53468.734 (15990, 28296, 9465)
> > parallel 1: 35712.613 ( 8254, 23569, 3700)
>
> These measurements were done before before phase I and III were
> stream-ified, but even here those phases were already the quickest
> ones for a modest number of indexes and a single worker. I think it
> would have an even bigger difference when only a small percentage of
> pages need scanning/vacuuming, because it still has to read the entire
> index. It's a bit unfortunate that the simplest heap phase to
> parallelize is also the quickest to begin with.
>
> Pre-existing issue: We don't get anywhere near 2x improvement in phase
> II for 2 parallel index scans. We've known for a while that the shared
> memory TID store has more overhead than private memory, and here that
> overhead is about the same as the entirety of phase III with a single
> worker! It may be time to look into mitigations there, independently
> of this thread.

Thank you for the comments.

I've done simple performance benchmarks and attached the results. In
this test, I prepared a table having one integer column and 3GB in
size. The 'fraction' column means the fraction of pages with deleted
rows; for example fraction=1000 means that the table has pages with
deleted rows every 1000 pages. The results show the duration for each
phase in addition to the total duration (for PATCHED only) and
compares the total duration between the HEAD and the PATCHED.

What I can see from these results was that we might not benefit much
from parallelizing phase III, unfortunately. Although in the best case
the phase III got about 2x speedup, as for the total duration it's
about only 10% speedup. My analysis for these results matches what
John mentioned; phase III is already the fastest phase and accounts
only ~10% of the total execution time, and the overhead of shared
TidStore offsets the speedup of phase III.

> The same commit that made the parallel scanning patch more difficult
> should also reduce the risk of having a large amount of freezing work
> at once in the first place. (There are still plenty of systemic things
> that can go wrong, of course, and it's always good if unpleasant work
> finishes faster.)

I think that vacuum would still need to scan a large amount of blocks
of the table especially when it is very large and heavily modified.
Parallel heap vacuum (only phase I) would be beneficial in case where
autovacuum could not catch up. And we might want to consider using
parallel heap vacuum also in autovacuum while integrating it with
eagar freeze scan.

> I seem to recall a proposal from David Rowley to (something like)
> batch gathering xids for visibility checks during executor scans, but
> if so I can't find it in the archives. It's possible some similar work
> might speed up heap scanning in a more localized fashion.

Interesting.

> On Tue, Nov 12, 2024 at 1:16 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Mon, Nov 11, 2024 at 5:08 AM Hayato Kuroda (Fujitsu)
> > <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> > > I wanted to know whether TidStoreBeginIterateShared() was needed. IIUC, pre-existing API,
> > > TidStoreBeginIterate(), has already accepted the shared TidStore. The only difference
> > > is whether elog(ERROR) exists, but I wonder if it benefits others. Is there another
> > > reason that lazy_vacuum_heap_rel() uses TidStoreBeginIterateShared()?
> >
> > TidStoreBeginIterateShared() is designed for multiple parallel workers
> > to iterate a shared TidStore. During an iteration, parallel workers
> > share the iteration state and iterate the underlying radix tree while
> > taking appropriate locks. Therefore, it's available only for a shared
> > TidStore. This is required to implement the parallel heap vacuum,
> > where multiple parallel workers do the iteration on the shared
> > TidStore.
> >
> > On the other hand, TidStoreBeginIterate() is designed for a single
> > process to iterate a TidStore. It accepts even a shared TidStore as
> > you mentioned, but during an iteration there is no inter-process
> > coordination such as locking. When it comes to parallel vacuum,
> > supporting TidStoreBeginIterate() on a shared TidStore is necessary to
> > cover the case where we use only parallel index vacuum but not
> > parallel heap scan/vacuum. In this case, we need to store dead tuple
> > TIDs on the shared TidStore during heap scan so parallel workers can
> > use it during index vacuum. But it's not necessary to use
> > TidStoreBeginIterateShared() because only one (leader) process does
> > heap vacuum.
>
> The takeaway I got is that the word "shared" is used for two different
> concepts. Future readers are not going to think to look back at this
> email for explanation. At the least there needs to be a different word
> for the new concept.
>
> There are quite a lot of additions to radix_tree.h and tidstore.c to
> make it work this way, including a new "control object", new locks,
> new atomic variables. I'm not sure it's really necessary. Is it
> possible to just store the "most recent block heap-vacuumed" in the
> shared vacuum state? Then each backend would lock the TID store,
> iterate starting at the next block, and unlock the store when it has
> enough blocks. Sometimes my brainstorms are unworkable for some reason
> I failed to think about, but this way seems 95% simpler -- we would
> only need to teach the existing iteration machinery to take a "start
> key".

Interesting idea. While it might be worth evaluating this idea, given
that the phase III accounts only a small portion of the total vacuum
execution time, it might be better to switch focusing on parallelizing
phase I.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
parallel_heap_vacuum_test_20250224.pdf application/pdf 61.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Alsup 2025-02-25 02:04:30 Re: Update docs for UUID data type
Previous Message Michael Paquier 2025-02-25 01:11:13 Re: Add Pipelining support in psql