Re: Parallel heap vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel heap vacuum
Date: 2024-12-19 22:05:30
Message-ID: CAD21AoBRpYz-e_VZGrdrqMcvv0j_0zfqSkS=a1ruWzu7KtQHgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 14, 2024 at 1:24 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 12/13/24 00:04, Tomas Vondra wrote:
> > ...
> >
> > The main difference is here:
> >
> >
> > master / no parallel workers:
> >
> > pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
> >
> > 1 parallel worker:
> >
> > pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
> >
> >
> > Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
> > essentially just those with deleted tuples, which is ~1/20 of pages.
> > That's close to the 15x speedup.
> >
> > This effect is clearest without indexes, but it does affect even runs
> > with indexes - having to scan the indexes makes it much less pronounced,
> > though. However, these indexes are pretty massive (about the same size
> > as the table) - multiple times larger than the table. Chances are it'd
> > be clearer on realistic data sets.
> >
> > So the question is - is this correct? And if yes, why doesn't the
> > regular (serial) vacuum do that?
> >
> > There's some more strange things, though. For example, how come the avg
> > read rate is 0.000 MB/s?
> >
> > avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
> >
> > It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
> > not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
> > pages are in shared buffers (thanks to the DELETE earlier in that session).
> >
>
> OK, after looking into this a bit more I think the reason is rather
> simple - SKIP_PAGES_THRESHOLD.
>
> With serial runs, we end up scanning all pages, because even with an
> update every 5000 tuples, that's still only ~25 pages apart, well within
> the 32-page window. So we end up skipping no pages, scan and vacuum all
> everything.
>
> But parallel runs have this skipping logic disabled, or rather the logic
> that switches to sequential scans if the gap is less than 32 pages.
>
>
> IMHO this raises two questions:
>
> 1) Shouldn't parallel runs use SKIP_PAGES_THRESHOLD too, i.e. switch to
> sequential scans is the pages are close enough. Maybe there is a reason
> for this difference? Workers can reduce the difference between random
> and sequential I/0, similarly to prefetching. But that just means the
> workers should use a lower threshold, e.g. as
>
> SKIP_PAGES_THRESHOLD / nworkers
>
> or something like that? I don't see this discussed in this thread.

Each parallel heap scan worker allocates a chunk of blocks which is
8192 blocks at maximum, so we would need to use the
SKIP_PAGE_THRESHOLD optimization within the chunk. I agree that we
need to evaluate the differences anyway. WIll do the benchmark test
and share the results.

>
> 2) It seems the current SKIP_PAGES_THRESHOLD is awfully high for good
> storage. If I can get an order of magnitude improvement (or more than
> that) by disabling the threshold, and just doing random I/O, maybe
> there's time to adjust it a bit.

Yeah, you've started a thread for this so let's discuss it there.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-12-19 22:23:06 Re: Converting SetOp to read its two inputs separately
Previous Message Tom Lane 2024-12-19 21:19:05 Re: Converting SetOp to read its two inputs separately