Re: Parallel heap vacuum

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel heap vacuum
Date: 2024-07-08 06:14:13
Message-ID: CAD21AoD2PR9XLaAcU92hp=SeLaPQ3AGevztttrh+uf5Ugu3H-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > # Benchmark results
> >
> > * Test-1: parallel heap scan on the table without indexes
> >
> > I created 20GB table, made garbage on the table, and run vacuum while
> > changing parallel degree:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 600000000); --- 20GB table
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > Here are the results (total time and heap scan time):
> >
> > PARALLEL 0: 21.99 s (single process)
> > PARALLEL 1: 11.39 s
> > PARALLEL 2: 8.36 s
> > PARALLEL 3: 6.14 s
> > PARALLEL 4: 5.08 s
> >
> > * Test-2: parallel heap scan on the table with one index
> >
> > I used a similar table to the test case 1 but created one btree index on it:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 600000000); --- 20GB table
> > create index on test (a);
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > I've measured the total execution time as well as the time of each
> > vacuum phase (from left heap scan time, index vacuum time, and heap
> > vacuum time):
> >
> > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
> >
> > Overall, I can see the parallel heap scan in lazy vacuum has a decent
> > scalability; In both test-1 and test-2, the execution time of heap
> > scan got ~4x faster with 4 parallel workers. On the other hand, when
> > it comes to the total vacuum execution time, I could not see much
> > performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> > and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> > worker (12.75 vs. 11.39).
> >
> > I think the reason is the shared TidStore is not very scalable since
> > we have a single lock on it. In all cases in the test-1, we don't use
> > the shared TidStore since all dead tuples are removed during heap
> > pruning. So the scalability was better overall than in test-2. In
> > parallel 0 case in test-2, we use the local TidStore, and from
> > parallel degree of 1 in test-2, we use the shared TidStore and
> > parallel worker concurrently update it. Also, I guess that the lookup
> > performance of the local TidStore is better than the shared TidStore's
> > lookup performance because of the differences between a bump context
> > and an DSA area. I think that this difference contributed the fact
> > that index vacuuming got slower (16.74 vs. 22.04).
> >

Thank you for the comments!

> > There are two obvious improvement ideas to improve overall vacuum
> > execution time: (1) improve the shared TidStore scalability and (2)
> > support parallel heap vacuum. For (1), several ideas are proposed by
> > the ART authors[1]. I've not tried these ideas but it might be
> > applicable to our ART implementation. But I prefer to start with (2)
> > since it would be easier. Feedback is very welcome.
> >
>
> Starting with (2) sounds like a reasonable approach. We should study a
> few more things like (a) the performance results where there are 3-4
> indexes,

Here are the results with 4 indexes (and restarting the server before
the benchmark):

PARALLEL 0: 115.48 s (32.76, 64.46, 18.24)
PARALLEL 1: 74.88 s (17.11, 44.43, 13.25)
PARALLEL 2: 71.15 s (14.13, 44.82, 12.12)
PARALLEL 3: 46.78 s (10.74, 24.50, 11.43)
PARALLEL 4: 46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap
scan and 3 workers for index vacuum)

> (b) What is the reason for performance improvement seen with
> only heap scans. We normally get benefits of parallelism because of
> using multiple CPUs but parallelizing scans (I/O) shouldn't give much
> benefits. Is it possible that you are seeing benefits because most of
> the data is either in shared_buffers or in memory? We can probably try
> vacuuming tables by restarting the nodes to ensure the data is not in
> memory.

I think it depends on the storage performance. FYI I use an EC2
instance (m6id.metal).

I've run the same benchmark script (table with no index) with
restarting the server before executing the vacuum, and here are the
results:

PARALLEL 0: 32.75 s
PARALLEL 1: 17.46 s
PARALLEL 2: 13.41 s
PARALLEL 3: 10.31 s
PARALLEL 4: 8.48 s

With the above two tests, I used the updated patch that I just submitted[1].

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn%2B2KYoSXEv%3DvXauw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-07-08 06:19:54 Re: walsender.c comment with no context is hard to understand
Previous Message Masahiko Sawada 2024-07-08 06:10:56 Re: Parallel heap vacuum