From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Parallel heap vacuum |
Date: | 2024-10-25 19:25:59 |
Message-ID: | CAD21AoCC7EXMOh2M10_-i77WY2qR=x5OEkBgau6KMz14a6UaVA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Oct 22, 2024 at 4:54 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Sorry for the very late reply.
>
> On Tue, Jul 30, 2024 at 8:54 PM Hayato Kuroda (Fujitsu)
> <kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
> >
> > Dear Sawada-san,
> >
> > > Thank you for testing!
> >
> > I tried to profile the vacuuming with the larger case (40 workers for the 20G table)
> > and attached FlameGraph showed the result. IIUC, I cannot find bottlenecks.
> >
> > > > 2.
> > > > I compared parallel heap scan and found that it does not have compute_worker
> > > API.
> > > > Can you clarify the reason why there is an inconsistency?
> > > > (I feel it is intentional because the calculation logic seems to depend on the
> > > heap structure,
> > > > so should we add the API for table scan as well?)
> > >
> > > There is room to consider a better API design, but yes, the reason is
> > > that the calculation logic depends on table AM implementation. For
> > > example, I thought it might make sense to consider taking the number
> > > of all-visible pages into account for the calculation of the number of
> > > parallel workers as we don't want to launch many workers on the table
> > > where most pages are all-visible. Which might not work for other table
> > > AMs.
> > >
> >
> > Okay, thanks for confirming. I wanted to ask others as well.
> >
> >
> > > I'm updating the patch to implement parallel heap vacuum and will
> > > share the updated patch. It might take time as it requires to
> > > implement shared iteration support in radx tree.
> >
> > Here are other preliminary comments for v2 patch. This does not contain
> > cosmetic ones.
> >
> > 1.
> > Shared data structure PHVShared does not contain the mutex lock. Is it intentional
> > because they are accessed by leader only after parallel workers exit?
> >
>
> Yes, the fields in PHVShared are read-only for workers. Since no
> concurrent reads/writes happen on these fields we don't need to
> protect them.
>
> > 2.
> > Per my understanding, the vacuuming goes like below steps.
> >
> > a. paralell workers are launched for scanning pages
> > b. leader waits until scans are done
> > c. leader does vacuum alone (you may extend here...)
> > d. parallel workers are launched again to cleanup indeces
> >
> > If so, can we reuse parallel workers for the cleanup? Or, this is painful
> > engineering than the benefit?
>
> I've not thought of this idea but I think it's possible from a
> technical perspective. It saves overheads of relaunching workers but
> I'm not sure how much it would help for a better performance and I'm
> concerned it would make the code complex. For example, different
> numbers of workers might be required for table vacuuming and index
> vacuuming. So we would end up increasing or decreasing workers.
>
> > 3.
> > According to LaunchParallelWorkers(), the bgw_name and bgw_type are hardcoded as
> > "parallel worker ..." Can we extend this to improve the trackability in the
> > pg_stat_activity?
>
> It would be a good improvement for better trackability. But I think we
> should do that in a separate patch as it's not just a problem for
> parallel heap vacuum.
>
> >
> > 4.
> > I'm not the expert TidStore, but as you said TidStoreLockExclusive() might be a
> > bottleneck when tid is added to the shared TidStore. My another primitive idea
> > is that to prepare per-worker TidStores (in the PHVScanWorkerState or LVRelCounters?)
> > and gather after the heap scanning. If you extend like parallel workers do vacuuming,
> > the gathering may not be needed: they can access own TidStore and clean up.
> > One downside is that the memory consumption may be quite large.
> >
>
> Interesting idea. Suppose we support parallel heap vacuum as well, we
> wouldn't need locks and to support shared-iteration on TidStore. I
> think each worker should use a fraction of maintenance_work_mem.
> However, one downside would be that we need to check as many TidStore
> as workers during index vacuuming.
On further thoughts, I think this idea doesn't go well. The index
vacuuming is the most time-consuming phase among vacuum phases, I
think it would not be a good idea to make it slower even if we can do
parallel heap scan and heap vacuum without any locking. Also, merging
multiple TidStore to one is not straightforward since the block ranges
that each worker processes are overlapped.
> FYI I've implemented the parallel heap vacuum part and am doing some
> benchmark tests. I'll share the updated patches along with test
> results this week.
Please find the attached patches. From the previous version, I made a
lot of changes including bug fixes, addressing review comments, and
adding parallel heap vacuum support. Parallel vacuum related
infrastructure are implemented in vacuumparallel.c, and lazyvacuum.c
now uses ParallelVacuumState for parallel heap scan/vacuum, index
bulkdelete/cleanup, or both. Parallel vacuum workers launch at the
beginning of each phase, and exit at the end of each phase. Since
different numbers of workers could be used for heap scan/vacuum and
index bulkdelete/cleanup, it's possible that only either heap
scan/vacuum or index bulkdelete/cleanup is parallelized.
In order to implement parallel heap vacuum, I extended radix tree and
tidstore to support shared iteration. The shared iteration works only
with a shared tidstore but a non-shared iteration works with a local
tidstore as well as a shared tidstore. For example, if a table is
large and has one index, we use only parallel heap scan/vacuum. In
this case, we store dead item TIDs into a shared tidstore during
parallel heap scan, but during parallel index bulk-deletion we perform
a non-shared iteration on the shared tidstore, which is more efficient
as it doesn't acquire any locks during the iteration.
I've done benchmark tests with a 10GB unlogged table (created on a
tmpfs tablespace) having 4 btree indexes while changing parallel
degrees. I restarted postgres server before each run to ensure that
data is not on the shared memory. I avoided disk I/O during lazy
vacuum as much as possible. Here is comparison between HEAD and
patched (took median of 5 runs):
+----------+-----------+-----------+-------------+
| parallel | HEAD | patched | improvement |
+----------+-----------+-----------+-------------+
| 0 | 53079.53 | 53468.734 | 1.007 |
| 1 | 48101.46 | 35712.613 | 0.742 |
| 2 | 37767.902 | 23566.426 | 0.624 |
| 4 | 38005.836 | 20192.055 | 0.531 |
| 8 | 37754.47 | 18614.717 | 0.493 |
+----------+-----------+-----------+-------------+
Here is the breakdowns of execution times of each vacuum phase (from
left, heap scan, index bulkdel, and heap vacuum):
- HEAD
parallel 0: 53079.530 (15886, 28039, 9270)
parallel 1: 48101.460 (15931, 23247, 9215)
parallel 2: 37767.902 (15259, 12888, 9479)
parallel 4: 38005.836 (16097, 12683, 9217)
parallel 8: 37754.470 (16016, 12535, 9306)
- Patched
parallel 0: 53468.734 (15990, 28296, 9465)
parallel 1: 35712.613 ( 8254, 23569, 3700)
parallel 2: 23566.426 ( 6180, 12760, 3283)
parallel 4: 20192.055 ( 4058, 12776, 2154)
parallel 8: 18614.717 ( 2797, 13244, 1579)
The index bulkdel phase is saturated with parallel 2 as one worker is
assigned to one index. On HEAD, there is no further performance gain
with more than 'parallel 4'. On the other hand, on Patched, it got
faster even at 'parallel 4' and 'parallel 8' since other two phases
were also done with parallel workers.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachment | Content-Type | Size |
---|---|---|
v3-0004-Support-parallel-heap-vacuum-during-lazy-vacuum.patch | application/octet-stream | 9.8 KB |
v3-0003-Support-shared-itereation-on-TidStore.patch | application/octet-stream | 3.6 KB |
v3-0001-Support-parallel-heap-scan-during-lazy-vacuum.patch | application/octet-stream | 82.7 KB |
v3-0002-raidxtree.h-support-shared-iteration.patch | application/octet-stream | 11.8 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tristan Partin | 2024-10-25 20:07:10 | Re: cpluspluscheck complains about use of register |
Previous Message | Tom Lane | 2024-10-25 19:21:54 | Re: Alias of VALUES RTE in explain plan |