From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
Cc: | Kirill Reshke <reshkekirill(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Parallel CREATE INDEX for GIN indexes |
Date: | 2025-02-19 21:54:16 |
Message-ID: | f600202f-89e2-4567-b0b8-101b7c9b07dd@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
After stress-testing all the patches (which yielded no issues except for
the barrier hang in 0005, which is not for commit yet), I proceeded to
do some basic perf testing.
I simply built a bunch of GIN indexes on a database with current mailing
list archives. The database is ~23GB, and the indexes were these:
CREATE INDEX headers_jsonb_path_idx
ON messages USING gin (msg_headers jsonb_path_ops);
CREATE INDEX headers_jsonb_idx
ON messages USING gin (msg_headers);
CREATE INDEX subject_trgm_idx
ON messages USING gin (msg_subject gin_trgm_ops);
CREATE INDEX body_tsvector_idx
ON messages USING gin (msg_body_tsvector);
CREATE INDEX subject_tsvector_idx
ON messages USING gin (msg_subject_tsvector);
So the indexes are on different data types, columns of different size,
etc. I did this on my two machines:
1) xeon - 44 cores, but old (~2016)
2) ryzen - 12 cores, brand new CPU (2024)
And I ran the CREATE INDEX with a range of worker counts (0, 1, 4, ...).
The count was set using ALTER TABLE, which just sets that without the
additional plan_create_index_workers() heuristics. There was always
enough workers to satisfy this.
The m_w_m was set to 1GB for all runs, which should leave "enough"
memory for to 32 workers (plan_create_index_workers leaves at least 32MB
per worker).
The results are in the attached PDF tables. I think the results are
mostly as expected ...
timing
------
For the "timing" charts, there are two colored sections. The first shows
"comparison to 0 workers" (i.e. serial build), and then "comparison to
ideal speedup" (essentially time/(N+1), where N is the number of
workers). In both cases green=good, red=bad.
The "patch" is the number of patch in the patch series, without the "0"
prefix. Patch "0" means "master" without patches.
How much the parallelism helps depends on the column. For some columns
(body_trgm, subject_trgm, subject_tsvector) it helps a lot, for others
it's less beneficial. But in all cases it helps, cutting the duration
(at least) in half.
On both machines the performance stops improving at ~4 workers. I guess
that's expected, and AFAICS we wouldn't really try to use more workers
for these index builds anyway.
One thing I don't quite understand is that on the ryzen machine, this
also seems to speed up patch "0" (i.e. master with no parallel builds).
At first I thought it's just random run-to-run noise, but looking at
those results it doesn't seem to be the case. E.g. for body_trgm_idx it
changes from ~686 seconds to ~634 seconds. For the other columns it's
less significant, but still pretty consistent.
On the xeon machine this doesn't happen at all.
I don't have a great explanation for this, because the patch does not
modify serial builds at all. The only idea I have is change in binary
layout between builds, but that's just "I don't know" in disguise.
temporary files
---------------
The other set of charts "temporary MB" shows amount of temporary files
produced with each of the patches. It's not showing "patch 0" (aka
master) because serial builds don't use temp files at all. The % values
are relative to "patch 1".
The 0002 patch is the compression, and that helps a lot (but depends on
the column). 0003 is just about enforcing memory limit, it does not
affect the temporary files at all.
Then 0004 "single tuplesort" does help a lot too, sometimes cutting the
amount in half. Which makes sense, because we suddenly don't need to
shuffle data between two tuplesorts.
But the results of 0005 are a bit bizarre - it mostly undoes the 0004
benefits, for some reason. I wonder why.
Anyway, I'm mostly happy about how this performs for 0001-0003, which
are the parts I plan to push in the coming days.
regards
--
Tomas Vondra
Attachment | Content-Type | Size |
---|---|---|
ryzen-temporary-mb.pdf | application/pdf | 43.3 KB |
ryzen-timing.pdf | application/pdf | 50.1 KB |
xeon-temporary-mb.pdf | application/pdf | 42.6 KB |
xeon-timing.pdf | application/pdf | 51.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-02-19 21:55:52 | Re: BUG #18815: Logical replication worker Segmentation fault |
Previous Message | Melanie Plageman | 2025-02-19 21:36:05 | Re: Trigger more frequent autovacuums of heavy insert tables |