Quick Links

Re: Parallel CREATE INDEX for GIN indexes

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:
Cc:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Parallel CREATE INDEX for GIN indexes
Date:	2024-07-05 19:45:31
Message-ID:	CAEze2WiTAeZe4t5wAeRN834xFBqROPmjeK2XTstNko6bbVPX=A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
>
> On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> > So I've switched this to use the regular data-type comparisons, with
> > SortSupport etc. There's a bit more cleanup remaining and testing
> > needed, but I'm not aware of any bugs.

I've hit assertion failures in my testing of the combined patches, in
AssertCheckItemPointers: it assumes it's never called when the buffer
is empty and uninitialized, but that's wrong: we don't initialize the
items array until the first tuple, which will cause the assertion to
fire. By updating the first 2 assertions of AssertCheckItemPointers, I
could get it working.

> ---
> > +++ b/src/backend/utils/sort/tuplesortvariants.c
>
> I was thinking some more about merging tuples inside the tuplesort. I
> realized that this could be implemented by allowing buffering of tuple
> writes in writetup. This would require adding a flush operation at the
> end of mergeonerun to store the final unflushed tuple on the tape, but
> that shouldn't be too expensive. This buffering, when implemented
> through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
> merge the TID lists of same-valued GIN tuples while they're getting
> stored and re-sorted, thus reducing the temporary space usage of the
> tuplesort by some amount with limited overhead for other
> non-deduplicating tuplesorts.
>
> I've not yet spent the time to get this to work though, but I'm fairly
> sure it'd use less temporary space than the current approach with the
> 2 tuplesorts, and could have lower overall CPU overhead as well
> because the number of sortable items gets reduced much earlier in the
> process.

I've now spent some time on this. Attached the original patchset, plus
2 incremental patches, the first of which implement the above design
(patch no. 8).

Local tests show it's significantly faster: for the below test case
I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
improvement.

After applying the 'reduce the size of GinTuple' patch, index creation
time is down to 551514ms, or about 29% faster total. This all was
tested with a fresh stock postgres configuration.

"""
CREATE UNLOGGED TABLE testdata
AS SELECT sha256(i::text::bytea)::text
FROM generate_series(1, 15000000) i;
CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
"""

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment	Content-Type	Size
v20240705-0003-Remove-the-explicit-pg_qsort-in-workers.patch	application/x-patch	10.2 KB
v20240705-0004-Compress-TID-lists-before-writing-tuples-t.patch	application/x-patch	7.9 KB
v20240705-0005-Collect-and-print-compression-stats.patch	application/x-patch	5.7 KB
v20240705-0002-Use-mergesort-in-the-leader-process.patch	application/x-patch	12.6 KB
v20240705-0001-Allow-parallel-create-for-GIN-indexes.patch	application/x-patch	61.7 KB
v20240705-0006-Enforce-memory-limit-when-combining-tuples.patch	application/x-patch	14.0 KB
v20240705-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patch	application/x-patch	5.5 KB
v20240705-0007-Detect-wrap-around-in-parallel-callback.patch	application/x-patch	8.1 KB
v20240705-0008-Use-a-single-GIN-tuplesort.patch	application/x-patch	32.8 KB

In response to

Re: Parallel CREATE INDEX for GIN indexes at 2024-07-03 18:36:26 from Matthias van de Meent

Responses

Re: Parallel CREATE INDEX for GIN indexes at 2024-07-07 16:26:03 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2024-07-05 19:50:39	Re: remove check hooks for GUCs that contribute to MaxBackends
Previous Message	Marat Bukharov	2024-07-05 19:43:56	Re: [PATCH] Add min/max aggregate functions to BYTEA