Quick Links

Re: Parallel CREATE INDEX for GIN indexes

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc:	Kirill Reshke <reshkekirill(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Parallel CREATE INDEX for GIN indexes
Date:	2025-02-26 20:08:39
Message-ID:	7421a646-d938-4558-b001-85d217eb43a5@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

While working on the progress reporting, I've been looking into the
performance results, particularly why the parallelism doesn't help much
for some indexes - e.g. the index on the headers JSONB column.

CREATE INDEX headers_jsonb_idx
ON messages USING gin (msg_headers);

In this case the parallelism helps only a little bit - serial build
takes ~47 seconds, parallel builds with 1 worker (so 2 with leader)
takes ~40 seconds. Not great.

There are two reasons for this. First, the "keys" (JSONB values) are
mostly unique, with only 1 or 2 TIDs per key, which means the workers
can't really do much merging. But shifting the merges to workers is the
main benefit of parallel builds - if the merge happens in the leader
anyway, this explains the lack of speedup.

The other reason is that with JSON keys the comparisons are rather
expensive, and we're comparing a lot of keys. It occurred to me we can
work around this by comparing hashes first, and comparing the full keys
only when the hashes match. And indeed, this helps a lot (there's a very
rough PoC patch attached) - I'm seeing ~20% speedup from this, so the
parallel build runs in ~30 seconds now. Still not quite serial speedup,
but better than before.

But I think this optimization is mostly orthogonal to parallel builds,
i.e. we could do the same thing for serial builds (while accumulating
data in memory, we could do these comparisons). But it needs to be
careful about still writing the data out in the "natural" order, not
ordered by hash. The hash randomizes the pattern, making it much less
efficient for bulk inserts (it trashes the buffers, etc.). The PoC patch
for parallel builds addresses this by ignoring the hash during the final
tuplesort, the serial builds would need to do something similar.

My conclusion is this can be left as a future improvement, independent
of the parallel builds.

regards

--
Tomas Vondra

Attachment	Content-Type	Size
poc-gin-build-hashing.patch	text/x-patch	15.8 KB

In response to

Re: Parallel CREATE INDEX for GIN indexes at 2025-02-25 15:49:03 from Tomas Vondra

Responses

Re: Parallel CREATE INDEX for GIN indexes at 2025-03-03 16:16:34 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2025-02-26 20:15:19	Re: Statistics Import and Export
Previous Message	Sami Imseih	2025-02-26 20:00:53	Re: Doc: clarify possibility of ephemeral discrepancies between state and wait_event in pg_stat_activity