Re: Parallel heap vacuum

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parallel heap vacuum
Date: 2024-12-12 23:04:02
Message-ID: 48f932af-f9d6-4dbe-bea6-cec5349e1937@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/9/24 19:47, Tomas Vondra wrote:
> Hi,
>
> Thanks for working on this. I took a quick look at this today, to do
> some basic review. I plan to do a bunch of testing, but that's mostly to
> get a better idea of what kind of improvements to expect - the initial
> results look quite nice and sensible.
>

I worked on the benchmarks/testing, mostly to get an idea of how
effective this vacuum parallelism is. But I noticed something a bit
weird ...

Attached is a bash script I used for the testing - it measures vacuum
with varying numbers of indexes, number of deleted tuples, WAL logging,
etc. And it does that both with master and patched builds, with
different number of vacuum workers.

It does expect databases "test-small-logged" and "test-small-unlogged",
initialized like this:

create [unlogged] table test_vacuum (a bigint)
with (autovacuum_enabled=off);

insert into test_vacuum select i from generate_series(1,100000000) s(i);

create index idx_0 on test_vacuum (a);
create index idx_1 on test_vacuum (a);
create index idx_2 on test_vacuum (a);
create index idx_3 on test_vacuum (a);
create index idx_4 on test_vacuum (a);

That's ~2GB table, with a bunch of indexes. Not massive, not tiny.

I wanted to run this on larger datasets too, but for now I have the
small dataset.

One of the things the tests change is the fraction of pages with deleted
rows. The DELETE has

... WHERE mod(id,M) = 0

where "id" is a bigint column with sequential values. There are ~230
rows per page, so the M determines what fraction of pages gets a DELETE.
With M=100, each page gets ~2 deleted rows, with M=500 we get a page
with a delete, then a clean page, etc. Similar for 1000 and 5000.

Attached are results.csv with raw data, and a PDF showing the difference
between master and patched build with varying number of workers. The
columns on the right show timing relative to master (with no parallel
workers). Green means "faster" and "red" would be "slower" (but there
are no such cases). 50% means "half the time" i.e. "twice as fast".

And for M=100 and M=500 the results look quite sensible. But for higher
values of M (i.e. smaller fraction of the table DELETED) things get a
bit strange, especially for the runs with 0 indexes.

Consider for example these runs from i5 machine with M=5000:

master patched
indexes 0 0 1 2 3 4 6 8
-------------------------------------------------------------------
0 2.58 2.75 0.17 0.19 0.16 0.24 0.20 0.19

On master it takes 2.58s, and on patched build (0 workers) it's ~2.75s,
so about the same (single run, so the difference is just noise).

But then with 1 worker it drops to 0.17s. That's ~15x faster, but we
only added one worker, so the best we could expect is 2x. Either there's
a bug that skips some work, or the master code is horribly inefficient.

The reason for the difference is this - on master, the vacuum verbose
log looks like this:

---
INFO: vacuuming "test.public.test_vacuum"
INFO: finished vacuuming "test.public.test_vacuum": index scans: 0
pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
tuples: 10000 removed, 49590000 remain, 0 are dead but not yet removable
removable cutoff: 20088, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 642.429 MB/s, avg write rate: 30.650 MB/s
buffer usage: 231616 hits, 210965 reads, 10065 dirtied
WAL usage: 30058 records, 10065 full page images, 72101687 bytes
system usage: CPU: user: 2.29 s, system: 0.27 s, elapsed: 2.56 s
---

and on patched with no parallelism it's almost the same:

---
INFO: vacuuming "test.public.test_vacuum"
INFO: finished vacuuming "test.public.test_vacuum": index scans: 0
pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)
tuples: 10000 removed, 49570000 remain, 0 are dead but not yet removable
removable cutoff: 20094, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 602.557 MB/s, avg write rate: 28.748 MB/s
buffer usage: 231620 hits, 210961 reads, 10065 dirtied
WAL usage: 30058 records, 10065 full page images, 71578455 bytes
system usage: CPU: user: 2.42 s, system: 0.30 s, elapsed: 2.73 s
---

But then for vacuum (parallel 1) it changes like this:

---
INFO: vacuuming "test.public.test_vacuum"
INFO: launched 1 parallel vacuum worker for table processing (planned: 1)
INFO: finished vacuuming "test.public.test_vacuum": index scans: 0
pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)
tuples: 10000 removed, 49137961 remain, 0 are dead but not yet removable
removable cutoff: 20107, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s
buffer usage: 25175 hits, 0 reads, 10065 dirtied
WAL usage: 30058 records, 10065 full page images, 70547639 bytes
system usage: CPU: user: 0.07 s, system: 0.02 s, elapsed: 0.14 s
---

The main difference is here:

master / no parallel workers:

pages: 0 removed, 221239 remain, 221239 scanned (100.00% of total)

1 parallel worker:

pages: 0 removed, 221239 remain, 10001 scanned (4.52% of total)

Clearly, with parallel vacuum we scan only a tiny fraction of the pages,
essentially just those with deleted tuples, which is ~1/20 of pages.
That's close to the 15x speedup.

This effect is clearest without indexes, but it does affect even runs
with indexes - having to scan the indexes makes it much less pronounced,
though. However, these indexes are pretty massive (about the same size
as the table) - multiple times larger than the table. Chances are it'd
be clearer on realistic data sets.

So the question is - is this correct? And if yes, why doesn't the
regular (serial) vacuum do that?

There's some more strange things, though. For example, how come the avg
read rate is 0.000 MB/s?

avg read rate: 0.000 MB/s, avg write rate: 525.533 MB/s

It scanned 10k pages, i.e. ~80MB of data in 0.15 seconds. Surely that's
not 0.000 MB/s? I guess it's calculated from buffer misses, and all the
pages are in shared buffers (thanks to the DELETE earlier in that session).

regards

--
Tomas Vondra

Attachment Content-Type Size
parallel-vacuum.sh application/x-shellscript 3.8 KB
parallel-vacuum.sql application/sql 335 bytes
results.csv text/csv 71.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-12-12 23:05:49 Re: Parallel heap vacuum
Previous Message Nathan Bossart 2024-12-12 22:14:33 Re: Proposal for Updating CRC32C with AVX-512 Algorithm.