Re: Parallel heap vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, John Naylor <johncnaylorls(at)gmail(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel heap vacuum
Date: 2025-03-23 08:45:35
Message-ID: CAD21AoAu7hifESz_4zdnFS1RpJChReTzqCLOwDd8FxzHCH6FJA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Mar 22, 2025 at 7:16 AM Melanie Plageman
<melanieplageman(at)gmail(dot)com> wrote:
>
> On Thu, Mar 20, 2025 at 4:36 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > When testing the multi passes of table vacuuming, I found an issue.
> > With the current patch, both leader and parallel workers process stop
> > the phase 1 as soon as the shared TidStore size reaches to the limit,
> > and then the leader resumes the parallel heap scan after heap
> > vacuuming and index vacuuming. Therefore, as I described in the patch,
> > one tricky part of this patch is that it's possible that we launch
> > fewer workers than the previous time when resuming phase 1 after phase
> > 2 and 3. In this case, since the previous parallel workers might have
> > already allocated some blocks in their chunk, newly launched workers
> > need to take over their parallel scan state. That's why in the patch
> > we store workers' ParallelBlockTableScanWorkerData in shared memory.
> > However, I found my assumption is wrong; in order to take over the
> > previous scan state properly we need to take over not only
> > ParallelBlockTableScanWorkerData but also ReadStream data as parallel
> > workers might have already queued some blocks for look-ahead in their
> > ReadStream. Looking at ReadStream codes, I find that it's not
> > realistic to store it into the shared memory.
>
> It seems like one way to solve this would be to add functionality to
> the read stream to unpin the buffers it has in the buffers queue
> without trying to continue calling the callback until the stream is
> exhausted.
>
> We have read_stream_reset(), but that is to restart streams that have
> already been exhausted. Exhausted streams are where the callback has
> returned InvalidBlockNumber. In the read_stream_reset() cases, the
> read stream user knows there are more blocks it would like to scan or
> that it would like to restart the scan from the beginning.
>
> Your case is you want to stop trying to exhaust the read stream and
> just unpin the remaining buffers. As long as the worker which paused
> phase I knows exactly the last block it processed and can communicate
> this to whatever worker resumes phase I later, it can initialize
> vacrel->current_block to the last block processed.

If we use ParallelBlockTableScanDesc with streaming read like the
patch did, we would also need to somehow rewind the number of blocks
allocated to workers. The problem I had with such usage was that a
parallel vacuum worker allocated a new chunk of blocks when doing
look-ahead reading and therefore advanced
ParallelBlockTableScanDescData.phs_nallocated. In this case, even if
we unpin the remaining buffers in the queue by a new functionality and
a parallel worker resumes the phase 1 from the last processed block,
we would lose some blocks in already allocated chunks unless we rewind
ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData
data. However, since a worker might have already allocated multiple
chunks it would not be easy to rewind these scan state data.

Another idea is that parallel workers don't exit phase 1 until it
consumes all pinned buffers in the queue, even if the memory usage of
TidStore exceeds the limit. It would need to add new functionality to
the read stream to disable the look-ahead reading. Since we could use
much memory while processing these buffers, exceeding the memory
limit, we can trigger this mode when the memory usage of TidStore
reaches 70% of the limit or so. On the other hand, it means that we
would not use the streaming read for the blocks in this mode, which is
not efficient.

>
> > One plausible solution would be that we don't use ReadStream in
> > parallel heap vacuum cases but directly use
> > table_block_parallelscan_xxx() instead. It works but we end up having
> > two different scan methods for parallel and non-parallel lazy heap
> > scan. I've implemented this idea in the attached v12 patches.
>
> One question is which scenarios will parallel vacuum phase I without
> AIO be faster than read AIO-ified vacuum phase I. Without AIO writes,
> I suppose it would be trivial for phase I parallel vacuum to be faster
> without using read AIO. But it's worth thinking about the tradeoff.

As Andres pointed out, there are major downsides. So we would need to
invent a way to stop and resume the read stream in the middle during
parallel scan.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2025-03-23 08:50:25 Re: Reduce "Var IS [NOT] NULL" quals during constant folding
Previous Message Rahila Syed 2025-03-23 08:36:47 Re: Improve monitoring of shared memory allocations