Re: Parallel CREATE INDEX for GIN indexes

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Kirill Reshke <reshkekirill(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for GIN indexes
Date: 2025-03-07 02:08:10
Message-ID: CAEze2WhSVuRf7yQtvDSpuWsXGjUvR=KGZfrxS+5_mjq0sstR2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 4 Mar 2025 at 20:50, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> I pushed the two smaller parts today.
>
> Here's the remaining two parts, to keep cfbot happy. I don't expect to
> get these into PG18, though.

As promised on- and off-list, here's the 0001 patch, polished, split,
and further adapted for performance.

As seen before, it reduces tempspace requirements by up to 50%. I've
not tested this against HEAD for performance.

It has been split into:

0001: Some API cleanup/changes that creaped into the patch. This
removes manual length-passing from the gin tuplesort APIs, instead
relying on GinTuple's tuplen field. It's not critical for anything,
and could be ignored if so desired.

0002: Tuplesort changes to allow TupleSort users to buffer and merge
tuples during the sort operations.
The patch was pulled directly from [0] (which was derived from earlier
work in this thread), is fairly easy to understand, and has no other
moving parts.

0003: Deduplication in tuplesort's flush-to-disk actions, utilizing
API introduced with 0002.
This improves temporary disk usage by deduplicating data even further,
for when there's a lot of duplicated data but the data has enough
distinct values to not fit in the available memory.

0004: Use a single tuplesort. This removes the worker-local tuplesort
in favor of only storing data in the global one.

This mainly reduces the code size and complexity of parallel GIN
builds; we already were using that global sort for various tasks.

Open questions and open items for this:
- I did not yet update the pg_stat_progress systems, nor docs.
- Maybe 0003 needs further splitting up, one for the optimizations in
GinBuffer, one for the tuplesort buffering.
- Maybe we need to trim the buffer in gin's tuplesort flush?
- Maybe we should grow the GinBuffer->items array superlinearly rather
than to the exact size requirement of the merge operation.

Apart from the complexities in 0003, I think the changes are fairly
straightforward.

I did not include the 0002 of the earlier patch, as it was WIP and its
feature explicitly conflicts with my 0004.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] https://www.postgresql.org/message-id/CAEze2WhRFzd=nvh9YevwiLjrS1j1fP85vjNCXAab=iybZ2rNKw@mail.gmail.com

Attachment Content-Type Size
v20250307-0004-Make-Gin-parallel-builds-use-a-single-tupl.patch application/octet-stream 7.0 KB
v20250307-0002-Allow-tuplesort-implementations-to-buffer-.patch application/octet-stream 6.0 KB
v20250307-0001-Remove-size-argument-from-GIN-tuplesort-in.patch application/octet-stream 5.6 KB
v20250307-0003-Merge-GinTuples-during-tuplesort-before-fl.patch application/octet-stream 29.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Corey Huinker 2025-03-07 02:56:52 Re: Statistics Import and Export
Previous Message Corey Huinker 2025-03-07 01:47:55 Re: Statistics Import and Export