Re: Parallel CREATE INDEX for GIN indexes

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Andy Fan <zhihuifan1213(at)163(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for GIN indexes
Date: 2024-08-27 13:28:12
Message-ID: CAEze2WjWdkR8s4L3dQsjPkoB4mkW-X8fB7oOZr58QfiwQbpRTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 27 Aug 2024 at 12:15, Andy Fan <zhihuifan1213(at)163(dot)com> wrote:
>
> Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:
> > And let's talk about the improvement by Matthias, namely:
> >
> > * 0008 Use a single GIN tuplesort
> > * 0009 Reduce the size of GinTuple by 12 bytes
> >
> > I haven't really seen any impact on duration - it seems more or less
> > within noise. Maybe it would be different on machines with less RAM, but
> > on my two systems it didn't really make a difference.
> >
> > It did significantly reduce the amount of temporary data written, by
> > ~40% or so. This is pretty nicely visible on the "trgm" case, which
> > generates the most temp files of the four indexes. An example from the
> > i5/32MB section looks like this:
> >
> > label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
> > ------------------------------------------------------------------------
> > trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682
> > 1016
>
> After seeing the above data, I want to know where the time is spent and
> why the ~40% IO doesn't make a measurable duration improvement. then I
> did the following test.
[...]
> ==== Here is my definition for stage 1/2/3.
> stage 1:
> reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
> ginBuildCallbackParallel, state, scan);
>
> /* write remaining accumulated entries */
> ginFlushBuildState(state, index);
>
> stage 2:
> _gin_process_worker_data(state, state->bs_worker_sort)
>
> stage 3:
> tuplesort_performsort(state->bs_sortstate);

Note that tuplesort does most of its sort and IO work while receiving
tuples, which in this case would be during table_index_build_scan.
tuplesort_performsort usually only needs to flush the last elements of
a sort that it still has in memory, which is thus generally a cheap
operation bound by maintenance work memory, and definitely not
representative of the total cost of sorting data. In certain rare
cases it may take a longer time as it may have to merge sorted runs,
but those cases are quite rare in my experience.

> But 0008 still does many good stuff:
> 1. Reduce the IO usage, this would be more useful on some heavily IO
> workload.
> 2. Simplify the building logic by removing one stage.
> 3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
> case for now, but it looks like a reasonable design.

I'd imagine nbtree would like to use this too, for applying some
deduplication in the sort stage. The IO benefits are quite likely to
be worth it; a minimum space saving of 25% on duplicated key values in
tuple sorts sounds real great. And it doesn't even have to merge all
duplicates: even if you only merge 10 tuples at a time, the space
saving on those duplicates would be at least 47% on 64-bit systems.

> I think the current blocker is if it is safe to hack the tuplesort.c.
> With my current knowledge, It looks good to me, but it would be better
> open a dedicated thread to discuss this specially, the review would not
> take a long time if a people who is experienced on this area would take
> a look.

I could adapt the patch for nbtree use, to see if anyone's willing to
review that?

> > Now, what's the 0010 patch about?
> >
> > For some indexes (e.g. trgm), the parallel builds help a lot, because
> > they produce a lot of temporary data and the parallel sort is a
> > substantial part of the work. But for other indexes (especially the
> > "smaller" indexes on jsonb headers), it's not that great. For example
> > for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.
> >
> > Clearly, this happens because a lot of time is spent outside the sort,
> > actually inserting data into the index.
>
> You can always foucs on the most important part which inpires me a lot,
> even with my simple testing, the "inserting data into index" stage take
> 40% time.

nbtree does sorted insertions into the tree, constructing leaf pages
one at a time and adding separator keys in the page above when the
leaf page was filled, thus removing the need to descend the btree. I
imagine we can save some performance by mirroring that in GIN too,
with as additional bonus that we'd be free to start logging completed
pages before we're done with the full index, reducing max WAL
throughput in GIN index creation.

> > I think they might be complementary. I don't think it's reasonable to
> > expect GIN's BuildAccumulator to buffer all the index tuples at the
> > same time (as I mentioned upthread: we are or should be limited by
> > work memory), but the BuildAccumulator will do a much better job at
> > combining tuples than the in-memory sort + merge-write done by
> > Tuplesort (because BA will use (much?) less memory for the same number
> > of stored values).
>
> Thank you Matthias for valuing my point! and thanks for highlighting the
> benefit that BuildAccumulator can do a better job for sorting in memory
> (I think it is mainly because BuildAccumulator can do run-time merge
> when accept more tuples). but I still not willing to go further at this
> direction. Reasons are: a). It probably can't make a big difference at
> the final result. b). The best implementation of this idea would be
> allowing the user of tuplesort.c to insert the pre-sort tuples into tape
> directly rather than inserting them into tuplesort's memory and dump
> them into tape without a sort.

You'd still need to keep track of sorted runs on those tapes, which is what
tuplesort.c does for us.

> However I can't define a clean API for
> the former case.

I imagine a pair of tuplesort_beginsortedrun();
tuplesort_endsortedrun() -functions to help this, but I'm not 100%
sure if we'd want to expose Tuplesort to non-PG sorting algorithms, as
it would be one easy way to create incorrect results if the sort used
in tuplesort isn't exactly equivalent to the sort used by the provider
of the tuples.

> c). create-index is a maintenance work, improving it by
> 30% would be good, but if we just improve it by <3, it looks not very
> charming in practice.

I think improving 3% on reindex operations can be well worth the effort.

Also, do note that the current patch does (still) not correctly handle
[maintenance_]work_mem: Every backend's BuildAccumulator uses up to
work_mem of memory here, while the launched tuplesorts use an
additional maintenance_work_mem of memory, for a total of (workers +
1) * work_mem + m_w_m of memory usage. The available memory should
instead be allocated between tuplesort and BuildAccumulator, but can
probably mostly be allocated to just BuildAccumulator if we can dump
the data into the tuplesort directly, as it'd reduce the overall
number of operations and memory allocations for the tuplesort. I think
that once we correctly account for memory allocations (and an improved
write path) we'll be able to see a meaningfully larger performance
improvement.

> So my option is if we can have agreement on 0008, then we can final
> review/test on the existing code (including 0009), and leave further
> improvement as a dedicated patch.

As mentioned above, I think I could update the patch for a btree
implementation that also has immediate benefits, if so desired?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2024-08-27 13:44:13 Re: Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM
Previous Message Alexander Lakhin 2024-08-27 13:00:00 Re: Streaming read-ready sequential scan code