From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Parallel heap vacuum |
Date: | 2025-01-17 01:06:14 |
Message-ID: | CAD21AoBN1=N8ZfQgWX=C8VxLj5tr1-Qu3ABCcpuMAOmHejO-Kw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Jan 12, 2025 at 1:34 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Fri, Jan 3, 2025 at 3:38 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > On Wed, Dec 25, 2024 at 8:52 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > >
> > >
> > >
> > > On 12/19/24 23:05, Masahiko Sawada wrote:
> > > > On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > > >>
> > > >> On 12/13/24 00:04, Tomas Vondra wrote:
> > > >>> ...
> > > >>>
> > > >>> The main difference is here:
> > > >>>
> > > >>>
> > > >>> master / no parallel workers:
> > > >>>
> > > >>> pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> > > >>>
> > > >>> 1 parallel worker:
> > > >>>
> > > >>> pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> > > >>>
> > > >>>
> > > >>> Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> > > >>> essentially just those with deleted tuples, which is ~1/20 of pages.
> > > >>> That's close to the 15x speedup.
> > > >>>
> > > >>> This effect is clearest without indexes, but it does affect even runs
> > > >>> with indexes - having to scan the indexes makes it much less pronounced,
> > > >>> though. However, these indexes are pretty massive (about the same size
> > > >>> as the table) - multiple times larger than the table. Chances are it'd
> > > >>> be clearer on realistic data sets.
> > > >>>
> > > >>> So the question is - is this correct? And if yes, why doesn't the
> > > >>> regular (serial) vacuum do that?
> > > >>>
> > > >>> There's some more strange things, though. For example, how come the avg
> > > >>> read rate is 0.000 MB/s?
> > > >>>
> > > >>> avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> > > >>>
> > > >>> It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> > > >>> not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> > > >>> pages are in shared buffers (thanks to the DELETE earlier in that session).
> > > >>>
> > > >>
> > > >> OK, after looking into this a bit more I think the reason is rather
> > > >> simple - SKIP_PAGES_THRESHOLD.
> > > >>
> > > >> With serial runs, we end up scanning all pages, because even with an
> > > >> update every 5000 tuples, that's still only ~25 pages apart, well within
> > > >> the 32-page window. So we end up skipping no pages, scan and vacuum all
> > > >> everything.
> > > >>
> > > >> But parallel runs have this skipping logic disabled, or rather the logic
> > > >> that switches to sequential scans if the gap is less than 32 pages.
> > > >>
> > > >>
> > > >> IMHO this raises two questions:
> > > >>
> > > >> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
> > > >> sequential scans is the pages are close enough. Maybe there is a reason
> > > >> for this difference? Workers can reduce the difference between random
> > > >> and sequential I/0, similarly to prefetching. But that just means the
> > > >> workers should use a lower threshold, e.g. as
> > > >>
> > > >> SKIP_PAGES_THRESHOLD / nworkers
> > > >>
> > > >> or something like that? I don't see this discussed in this thread.
> > > >
> > > > Each parallel heap scan worker allocates a chunk of blocks which is
> > > > 8192 blocks at maximum, so we would need to use the
> > > > SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
> > > > need to evaluate the differences anyway. WIll do the benchmark test
> > > > and share the results.
> > > >
> > >
> > > Right. I don't think this really matters for small tables, and for large
> > > tables the chunks should be fairly large (possibly up to 8192 blocks),
> > > in which case we could apply SKIP_PAGE_THRESHOLD just like in the serial
> > > case. There might be differences at boundaries between chunks, but that
> > > seems like a minor / expected detail. I haven't checked know if the code
> > > would need to change / how much.
> > >
> > > >>
> > > >> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
> > > >> storage. If I can get an order of magnitude improvement (or more than
> > > >> that) by disabling the threshold, and just doing random I/O, maybe
> > > >> there's time to adjust it a bit.
> > > >
> > > > Yeah, you've started a thread for this so let's discuss it there.
> > > >
> > >
> > > OK. FWIW as suggested in the other thread, it doesn't seem to be merely
> > > a question of VACUUM performance, as not skipping pages gives vacuum the
> > > opportunity to do cleanup that would otherwise need to happen later.
> > >
> > > If only for this reason, I think it would be good to keep the serial and
> > > parallel vacuum consistent.
> > >
> >
> > I've not evaluated SKIP_PAGE_THRESHOLD optimization yet but I'd like
> > to share the latest patch set as cfbot reports some failures. Comments
> > from Kuroda-san are also incorporated in this version. Also, I'd like
> > to share the performance test results I did with the latest patch.
> >
>
> I've implemented SKIP_PAGE_THRESHOLD optimization in parallel heap
> scan, and attached the updated patch set. I've attached the
> performance test results too to compare v6 and v7 patch sets. I can
> see there are not big differences in test cases but the v7 patch has a
> slightly better performance.
I've made some changes to the patch set. Firstly, I've removed adding
TidStoreNumBlocks() since I figured out that we don't necessarily need
it. Also, I've split the parallel lazy heap scan patches further to
make it easier to review. Feedback is very welcome.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v8-0009-Support-parallel-heap-vacuum-during-lazy-vacuum.patch | application/octet-stream | 23.7 KB |
v8-0005-vacuumparallel.c-Support-parallel-table-vacuuming.patch | application/octet-stream | 18.0 KB |
v8-0007-raidxtree.h-support-shared-iteration.patch | application/octet-stream | 16.0 KB |
v8-0008-Support-shared-itereation-on-TidStore.patch | application/octet-stream | 7.1 KB |
v8-0004-Add-table-APIs-for-parallel-table-vacuuming.patch | application/octet-stream | 4.8 KB |
v8-0003-Move-GlobalVisState-definition-to-snapmgr_interna.patch | application/octet-stream | 9.1 KB |
v8-0006-Support-parallel-heap-scan-during-lazy-vacuum.patch | application/octet-stream | 47.6 KB |
v8-0002-Remember-the-number-of-times-parallel-index-vacuu.patch | application/octet-stream | 6.8 KB |
v8-0001-Move-lazy-heap-scan-related-variables-to-new-stru.patch | application/octet-stream | 27.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2025-01-17 01:53:06 | Re: Pgoutput not capturing the generated columns |
Previous Message | Roman Eskin | 2025-01-17 01:05:53 | Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done |