Re: Parallel heap vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel heap vacuum
Date: 2025-01-03 23:38:28
Message-ID: CAD21AoCZeVOQfCz6MoAJJic38M9jdiszoAP5YFuTnJPUMwPc9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 25, 2024 at 8:52 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>
>
> On 12/19/24 23:05, Masahiko Sawada wrote:
> > On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >>
> >> On 12/13/24 00:04, Tomas Vondra wrote:
> >>> ...
> >>>
> >>> The main difference is here:
> >>>
> >>>
> >>> master / no parallel workers:
> >>>
> >>> pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> >>>
> >>> 1 parallel worker:
> >>>
> >>> pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> >>>
> >>>
> >>> Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> >>> essentially just those with deleted tuples, which is ~1/20 of pages.
> >>> That's close to the 15x speedup.
> >>>
> >>> This effect is clearest without indexes, but it does affect even runs
> >>> with indexes - having to scan the indexes makes it much less pronounced,
> >>> though. However, these indexes are pretty massive (about the same size
> >>> as the table) - multiple times larger than the table. Chances are it'd
> >>> be clearer on realistic data sets.
> >>>
> >>> So the question is - is this correct? And if yes, why doesn't the
> >>> regular (serial) vacuum do that?
> >>>
> >>> There's some more strange things, though. For example, how come the avg
> >>> read rate is 0.000 MB/s?
> >>>
> >>> avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> >>>
> >>> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> >>> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> >>> pages are in shared buffers (thanks to the DELETE earlier in that session).
> >>>
> >>
> >> OK, after looking into this a bit more I think the reason is rather
> >> simple - SKIP_PAGES_THRESHOLD.
> >>
> >> With serial runs, we end up scanning all pages, because even with an
> >> update every 5000 tuples, that's still only ~25 pages apart, well within
> >> the 32-page window. So we end up skipping no pages, scan and vacuum all
> >> everything.
> >>
> >> But parallel runs have this skipping logic disabled, or rather the logic
> >> that switches to sequential scans if the gap is less than 32 pages.
> >>
> >>
> >> IMHO this raises two questions:
> >>
> >> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
> >> sequential scans is the pages are close enough. Maybe there is a reason
> >> for this difference? Workers can reduce the difference between random
> >> and sequential I/0, similarly to prefetching. But that just means the
> >> workers should use a lower threshold, e.g. as
> >>
> >> SKIP_PAGES_THRESHOLD / nworkers
> >>
> >> or something like that? I don't see this discussed in this thread.
> >
> > Each parallel heap scan worker allocates a chunk of blocks which is
> > 8192 blocks at maximum, so we would need to use the
> > SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
> > need to evaluate the differences anyway. WIll do the benchmark test
> > and share the results.
> >
>
> Right. I don't think this really matters for small tables, and for large
> tables the chunks should be fairly large (possibly up to 8192 blocks),
> in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
> case. There might be differences at boundaries between chunks, but that
> seems like a minor / expected detail. I haven't checked know if the code
> would need to change / how much.
>
> >>
> >> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
> >> storage. If I can get an order of magnitude improvement (or more than
> >> that) by disabling the threshold, and just doing random I/O, maybe
> >> there's time to adjust it a bit.
> >
> > Yeah, you've started a thread for this so let's discuss it there.
> >
>
> OK. FWIW as suggested in the other thread, it doesn't seem to be merely
> a question of VACUUM performance, as not skipping pages gives vacuum the
> opportunity to do cleanup that would otherwise need to happen later.
>
> If only for this reason, I think it would be good to keep the serial and
> parallel vacuum consistent.
>

I've not evaluated SKIP_PAGE_THRESHOLD optimization yet but I'd like
to share the latest patch set as cfbot reports some failures. Comments
from Kuroda-san are also incorporated in this version. Also, I'd like
to share the performance test results I did with the latest patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
parallel_heap_vacuum_benchmark_v6.pdf application/pdf 37.8 KB
v6-0004-raidxtree.h-support-shared-iteration.patch application/octet-stream 16.8 KB
v6-0006-radixtree.h-Add-RT_NUM_KEY-API-to-get-the-number-.patch application/octet-stream 1.9 KB
v6-0005-Support-shared-itereation-on-TidStore.patch application/octet-stream 7.1 KB
v6-0008-Support-parallel-heap-vacuum-during-lazy-vacuum.patch application/octet-stream 22.3 KB
v6-0007-Add-TidStoreNumBlocks-API-to-get-the-number-of-bl.patch application/octet-stream 1.6 KB
v6-0003-Support-parallel-heap-scan-during-lazy-vacuum.patch application/octet-stream 74.5 KB
v6-0002-Remember-the-number-of-times-parallel-index-vacuu.patch application/octet-stream 6.8 KB
v6-0001-Move-lazy-heap-scanning-related-variables-to-stru.patch application/octet-stream 27.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2025-01-03 23:39:38 Re: Fwd: Re: A new look at old NFS readdir() problems?
Previous Message Pavel Stehule 2025-01-03 22:59:30 Re: Re: proposal: schema variables