From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Parallel heap vacuum |
Date: | 2025-01-03 23:38:28 |
Message-ID: | CAD21AoCZeVOQfCz6MoAJJic38M9jdiszoAP5YFuTnJPUMwPc9Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Dec 25, 2024 at 8:52 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>
>
> On 12/19/24 23:05, Masahiko Sawada wrote:
> > On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >>
> >> On 12/13/24 00:04, Tomas Vondra wrote:
> >>> ...
> >>>
> >>> The main difference is here:
> >>>
> >>>
> >>> master / no parallel workers:
> >>>
> >>> pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> >>>
> >>> 1 parallel worker:
> >>>
> >>> pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> >>>
> >>>
> >>> Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> >>> essentially just those with deleted tuples, which is ~1/20 of pages.
> >>> That's close to the 15x speedup.
> >>>
> >>> This effect is clearest without indexes, but it does affect even runs
> >>> with indexes - having to scan the indexes makes it much less pronounced,
> >>> though. However, these indexes are pretty massive (about the same size
> >>> as the table) - multiple times larger than the table. Chances are it'd
> >>> be clearer on realistic data sets.
> >>>
> >>> So the question is - is this correct? And if yes, why doesn't the
> >>> regular (serial) vacuum do that?
> >>>
> >>> There's some more strange things, though. For example, how come the avg
> >>> read rate is 0.000 MB/s?
> >>>
> >>> avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> >>>
> >>> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> >>> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> >>> pages are in shared buffers (thanks to the DELETE earlier in that session).
> >>>
> >>
> >> OK, after looking into this a bit more I think the reason is rather
> >> simple - SKIP_PAGES_THRESHOLD.
> >>
> >> With serial runs, we end up scanning all pages, because even with an
> >> update every 5000 tuples, that's still only ~25 pages apart, well within
> >> the 32-page window. So we end up skipping no pages, scan and vacuum all
> >> everything.
> >>
> >> But parallel runs have this skipping logic disabled, or rather the logic
> >> that switches to sequential scans if the gap is less than 32 pages.
> >>
> >>
> >> IMHO this raises two questions:
> >>
> >> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
> >> sequential scans is the pages are close enough. Maybe there is a reason
> >> for this difference? Workers can reduce the difference between random
> >> and sequential I/0, similarly to prefetching. But that just means the
> >> workers should use a lower threshold, e.g. as
> >>
> >> SKIP_PAGES_THRESHOLD / nworkers
> >>
> >> or something like that? I don't see this discussed in this thread.
> >
> > Each parallel heap scan worker allocates a chunk of blocks which is
> > 8192 blocks at maximum, so we would need to use the
> > SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
> > need to evaluate the differences anyway. WIll do the benchmark test
> > and share the results.
> >
>
> Right. I don't think this really matters for small tables, and for large
> tables the chunks should be fairly large (possibly up to 8192 blocks),
> in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
> case. There might be differences at boundaries between chunks, but that
> seems like a minor / expected detail. I haven't checked know if the code
> would need to change / how much.
>
> >>
> >> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
> >> storage. If I can get an order of magnitude improvement (or more than
> >> that) by disabling the threshold, and just doing random I/O, maybe
> >> there's time to adjust it a bit.
> >
> > Yeah, you've started a thread for this so let's discuss it there.
> >
>
> OK. FWIW as suggested in the other thread, it doesn't seem to be merely
> a question of VACUUM performance, as not skipping pages gives vacuum the
> opportunity to do cleanup that would otherwise need to happen later.
>
> If only for this reason, I think it would be good to keep the serial and
> parallel vacuum consistent.
>
I've not evaluated SKIP_PAGE_THRESHOLD optimization yet but I'd like
to share the latest patch set as cfbot reports some failures. Comments
from Kuroda-san are also incorporated in this version. Also, I'd like
to share the performance test results I did with the latest patch.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
parallel_heap_vacuum_benchmark_v6.pdf | application/pdf | 37.8 KB |
v6-0004-raidxtree.h-support-shared-iteration.patch | application/octet-stream | 16.8 KB |
v6-0006-radixtree.h-Add-RT_NUM_KEY-API-to-get-the-number-.patch | application/octet-stream | 1.9 KB |
v6-0005-Support-shared-itereation-on-TidStore.patch | application/octet-stream | 7.1 KB |
v6-0008-Support-parallel-heap-vacuum-during-lazy-vacuum.patch | application/octet-stream | 22.3 KB |
v6-0007-Add-TidStoreNumBlocks-API-to-get-the-number-of-bl.patch | application/octet-stream | 1.6 KB |
v6-0003-Support-parallel-heap-scan-during-lazy-vacuum.patch | application/octet-stream | 74.5 KB |
v6-0002-Remember-the-number-of-times-parallel-index-vacuu.patch | application/octet-stream | 6.8 KB |
v6-0001-Move-lazy-heap-scanning-related-variables-to-stru.patch | application/octet-stream | 27.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2025-01-03 23:39:38 | Re: Fwd: Re: A new look at old NFS readdir() problems? |
Previous Message | Pavel Stehule | 2025-01-03 22:59:30 | Re: Re: proposal: schema variables |