Re: Parallel CREATE INDEX for GIN indexes

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Parallel CREATE INDEX for GIN indexes
Date: 2024-07-05 19:45:31
Message-ID: CAEze2WiTAeZe4t5wAeRN834xFBqROPmjeK2XTstNko6bbVPX=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
>
> On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> > So I've switched this to use the regular data-type comparisons, with
> > SortSupport etc. There's a bit more cleanup remaining and testing
> > needed, but I'm not aware of any bugs.

I've hit assertion failures in my testing of the combined patches, in
AssertCheckItemPointers: it assumes it's never called when the buffer
is empty and uninitialized, but that's wrong: we don't initialize the
items array until the first tuple, which will cause the assertion to
fire. By updating the first 2 assertions of AssertCheckItemPointers, I
could get it working.

> ---
> > +++ b/src/backend/utils/sort/tuplesortvariants.c
>
> I was thinking some more about merging tuples inside the tuplesort. I
> realized that this could be implemented by allowing buffering of tuple
> writes in writetup. This would require adding a flush operation at the
> end of mergeonerun to store the final unflushed tuple on the tape, but
> that shouldn't be too expensive. This buffering, when implemented
> through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
> merge the TID lists of same-valued GIN tuples while they're getting
> stored and re-sorted, thus reducing the temporary space usage of the
> tuplesort by some amount with limited overhead for other
> non-deduplicating tuplesorts.
>
> I've not yet spent the time to get this to work though, but I'm fairly
> sure it'd use less temporary space than the current approach with the
> 2 tuplesorts, and could have lower overall CPU overhead as well
> because the number of sortable items gets reduced much earlier in the
> process.

I've now spent some time on this. Attached the original patchset, plus
2 incremental patches, the first of which implement the above design
(patch no. 8).

Local tests show it's significantly faster: for the below test case
I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
improvement.

After applying the 'reduce the size of GinTuple' patch, index creation
time is down to 551514ms, or about 29% faster total. This all was
tested with a fresh stock postgres configuration.

"""
CREATE UNLOGGED TABLE testdata
AS SELECT sha256(i::text::bytea)::text
FROM generate_series(1, 15000000) i;
CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
"""

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment Content-Type Size
v20240705-0003-Remove-the-explicit-pg_qsort-in-workers.patch application/x-patch 10.2 KB
v20240705-0004-Compress-TID-lists-before-writing-tuples-t.patch application/x-patch 7.9 KB
v20240705-0005-Collect-and-print-compression-stats.patch application/x-patch 5.7 KB
v20240705-0002-Use-mergesort-in-the-leader-process.patch application/x-patch 12.6 KB
v20240705-0001-Allow-parallel-create-for-GIN-indexes.patch application/x-patch 61.7 KB
v20240705-0006-Enforce-memory-limit-when-combining-tuples.patch application/x-patch 14.0 KB
v20240705-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patch application/x-patch 5.5 KB
v20240705-0007-Detect-wrap-around-in-parallel-callback.patch application/x-patch 8.1 KB
v20240705-0008-Use-a-single-GIN-tuplesort.patch application/x-patch 32.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-07-05 19:50:39 Re: remove check hooks for GUCs that contribute to MaxBackends
Previous Message Marat Bukharov 2024-07-05 19:43:56 Re: [PATCH] Add min/max aggregate functions to BYTEA