Re: Parallel heap vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel heap vacuum
Date: 2024-10-30 17:17:49
Message-ID: CAD21AoBsQup1cnurYtkDU-GX8Ytuu1x_adUCr++bn8JeER=0mw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 25, 2024 at 12:25 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> On Tue, Oct 22, 2024 at 4:54 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > Sorry for the very late reply.
> >
> > On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
> > <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> > >
> > > Dear Sawada-san,
> > >
> > > > Thank you for testing!
> > >
> > > I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
> > > and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.
> > >
> > > > > 2.
> > > > > I compared parallel heap scan and found that it does not have compute_worker
> > > > API.
> > > > > Can you clarify the reason why there is an inconsistency?
> > > > > (I feel it is intentional because the calculation logic seems to depend on the
> > > > heap structure,
> > > > > so should we add the API for table scan as well?)
> > > >
> > > > There is room to consider a better API design, but yes, the reason is
> > > > that the calculation logic depends on table AM implementation. For
> > > > example, I thought it might make sense to consider taking the number
> > > > of all-visible pages into account for the calculation of the number of
> > > > parallel workers as we don't want to launch many workers on the table
> > > > where most pages are all-visible. Which might not work for other table
> > > > AMs.
> > > >
> > >
> > > Okay, thanks for confirming. I wanted to ask others as well.
> > >
> > >
> > > > I'm updating the patch to implement parallel heap vacuum and will
> > > > share the updated patch. It might take time as it requires to
> > > > implement shared iteration support in radx tree.
> > >
> > > Here are other preliminary comments for v2 patch. This does not contain
> > > cosmetic ones.
> > >
> > > 1.
> > > Shared data structure PHVShared does not contain the mutex lock. Is it intentional
> > > because they are accessed by leader only after parallel workers exit?
> > >
> >
> > Yes, the fields in PHVShared are read-only for workers. Since no
> > concurrent reads/writes happen on these fields we don't need to
> > protect them.
> >
> > > 2.
> > > Per my understanding, the vacuuming goes like below steps.
> > >
> > > a. paralell workers are launched for scanning pages
> > > b. leader waits until scans are done
> > > c. leader does vacuum alone (you may extend here...)
> > > d. parallel workers are launched again to cleanup indeces
> > >
> > > If so, can we reuse parallel workers for the cleanup? Or, this is painful
> > > engineering than the benefit?
> >
> > I've not thought of this idea but I think it's possible from a
> > technical perspective. It saves overheads of relaunching workers but
> > I'm not sure how much it would help for a better performance and I'm
> > concerned it would make the code complex. For example, different
> > numbers of workers might be required for table vacuuming and index
> > vacuuming. So we would end up increasing or decreasing workers.
> >
> > > 3.
> > > According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
> > > "parallel worker ..." Can we extend this to improve the trackability in the
> > > pg_stat_activity?
> >
> > It would be a good improvement for better trackability. But I think we
> > should do that in a separate patch as it's not just a problem for
> > parallel heap vacuum.
> >
> > >
> > > 4.
> > > I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
> > > bottleneck when tid is added to the shared TidStore. My another primitive idea
> > > is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
> > > and gather after the heap scanning. If you extend like parallel workers do vacuuming,
> > > the gathering may not be needed: they can access own TidStore and clean up.
> > > One downside is that the memory consumption may be quite large.
> > >
> >
> > Interesting idea. Suppose we support parallel heap vacuum as well, we
> > wouldn't need locks and to support shared-iteration on TidStore. I
> > think each worker should use a fraction of maintenance_work_mem.
> > However, one downside would be that we need to check as many TidStore
> > as workers during index vacuuming.
>
> On further thoughts, I think this idea doesn't go well. The index
> vacuuming is the most time-consuming phase among vacuum phases, I
> think it would not be a good idea to make it slower even if we can do
> parallel heap scan and heap vacuum without any locking. Also, merging
> multiple TidStore to one is not straightforward since the block ranges
> that each worker processes are overlapped.
>
> > FYI I've implemented the parallel heap vacuum part and am doing some
> > benchmark tests. I'll share the updated patches along with test
> > results this week.
>
> Please find the attached patches. From the previous version, I made a
> lot of changes including bug fixes, addressing review comments, and
> adding parallel heap vacuum support. Parallel vacuum related
> infrastructure are implemented in vacuumparallel.c, and lazyvacuum.c
> now uses ParallelVacuumState for parallel heap scan/vacuum, index
> bulkdelete/cleanup, or both. Parallel vacuum workers launch at the
> beginning of each phase, and exit at the end of each phase. Since
> different numbers of workers could be used for heap scan/vacuum and
> index bulkdelete/cleanup, it's possible that only either heap
> scan/vacuum or index bulkdelete/cleanup is parallelized.
>
> In order to implement parallel heap vacuum, I extended radix tree and
> tidstore to support shared iteration. The shared iteration works only
> with a shared tidstore but a non-shared iteration works with a local
> tidstore as well as a shared tidstore. For example, if a table is
> large and has one index, we use only parallel heap scan/vacuum. In
> this case, we store dead item TIDs into a shared tidstore during
> parallel heap scan, but during parallel index bulk-deletion we perform
> a non-shared iteration on the shared tidstore, which is more efficient
> as it doesn't acquire any locks during the iteration.
>
> I've done benchmark tests with a 10GB unlogged table (created on a
> tmpfs tablespace) having 4 btree indexes while changing parallel
> degrees. I restarted postgres server before each run to ensure that
> data is not on the shared memory. I avoided disk I/O during lazy
> vacuum as much as possible. Here is comparison between HEAD and
> patched (took median of 5 runs):
>
> +----------+-----------+-----------+-------------+
> | parallel | HEAD | patched | improvement |
> +----------+-----------+-----------+-------------+
> | 0 | 53079.53 | 53468.734 | 1.007 |
> | 1 | 48101.46 | 35712.613 | 0.742 |
> | 2 | 37767.902 | 23566.426 | 0.624 |
> | 4 | 38005.836 | 20192.055 | 0.531 |
> | 8 | 37754.47 | 18614.717 | 0.493 |
> +----------+-----------+-----------+-------------+
>
> Here is the breakdowns of execution times of each vacuum phase (from
> left, heap scan, index bulkdel, and heap vacuum):
>
> - HEAD
> parallel 0: 53079.530 (15886, 28039, 9270)
> parallel 1: 48101.460 (15931, 23247, 9215)
> parallel 2: 37767.902 (15259, 12888, 9479)
> parallel 4: 38005.836 (16097, 12683, 9217)
> parallel 8: 37754.470 (16016, 12535, 9306)
>
> - Patched
> parallel 0: 53468.734 (15990, 28296, 9465)
> parallel 1: 35712.613 ( 8254, 23569, 3700)
> parallel 2: 23566.426 ( 6180, 12760, 3283)
> parallel 4: 20192.055 ( 4058, 12776, 2154)
> parallel 8: 18614.717 ( 2797, 13244, 1579)
>
> The index bulkdel phase is saturated with parallel 2 as one worker is
> assigned to one index. On HEAD, there is no further performance gain
> with more than 'parallel 4'. On the other hand, on Patched, it got
> faster even at 'parallel 4' and 'parallel 8' since other two phases
> were also done with parallel workers.

I've attached new version patches that fixes failures reported by
cfbot. I hope these changes make cfbot happy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment Content-Type Size
v4-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patch application/octet-stream 11.8 KB
v4-0001-Support-parallel-heap-scan-during-lazy-vacuum.patch application/octet-stream 83.2 KB
v4-0003-Support-shared-itereation-on-TidStore.patch application/octet-stream 3.6 KB
v4-0002-raidxtree.h-support-shared-iteration.patch application/octet-stream 11.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-10-30 17:23:54 Re: Interrupts vs signals
Previous Message Ants Aasma 2024-10-30 17:17:47 Re: protocol-level wait-for-LSN