Re: Parallel CREATE INDEX for GIN indexes

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for GIN indexes
Date: 2024-08-28 00:37:55
Message-ID: 874j754h3w.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> writes:

> tuplesort_performsort usually only needs to flush the last elements of
> ... In certain rare
> cases it may take a longer time as it may have to merge sorted runs,
> but those cases are quite rare in my experience.

OK, I am expecting such cases are not rare, Suppose we have hundreds of
GB heap tuple, and have the 64MB or 1GB maintenance_work_mem setup, it
probably hit this sistuation. I'm not mean at which experience is the
fact, but I am just to highlights the gap in our minds. and thanks for
sharing this, I can pay more attention in this direction in my future
work. To be clearer, my setup hit the 'mergeruns' case.

>
>> But 0008 still does many good stuff:
>> 1. Reduce the IO usage, this would be more useful on some heavily IO
>> workload.
>> 2. Simplify the building logic by removing one stage.
>> 3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
>> case for now, but it looks like a reasonable design.
>
> I'd imagine nbtree would like to use this too, for applying some
> deduplication in the sort stage.

The current ntbtree do the deduplication during insert into the nbtree
IIUC, in your new strategy, we can move it the "sort" stage, which looks
good to me [to confirm my understanding].

> The IO benefits are quite likely to
> be worth it; a minimum space saving of 25% on duplicated key values in
> tuple sorts sounds real great.

Just be clearer on the knowledge, the IO benefits can be only gained
when the tuplesort's memory can't hold all the tuples, and in such case,
tuplesort_performsort would run the 'mergeruns', or else we can't get
any benefit?

>
>> I think the current blocker is if it is safe to hack the tuplesort.c.
>> With my current knowledge, It looks good to me, but it would be better
>> open a dedicated thread to discuss this specially, the review would not
>> take a long time if a people who is experienced on this area would take
>> a look.
>
> I could adapt the patch for nbtree use, to see if anyone's willing to
> review that?

I'm interested with it and can do some review & testing. But the
keypoint would be we need some authorities are willing to review it, to
make it happen to a bigger extent, a dedicated thread would be helpful.

>> > Now, what's the 0010 patch about?
>> >
>> > For some indexes (e.g. trgm), the parallel builds help a lot, because
>> > they produce a lot of temporary data and the parallel sort is a
>> > substantial part of the work. But for other indexes (especially the
>> > "smaller" indexes on jsonb headers), it's not that great. For example
>> > for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.
>> >
>> > Clearly, this happens because a lot of time is spent outside the sort,
>> > actually inserting data into the index.
>>
>> You can always foucs on the most important part which inpires me a lot,
>> even with my simple testing, the "inserting data into index" stage take
>> 40% time.
>
> nbtree does sorted insertions into the tree, constructing leaf pages
> one at a time and adding separator keys in the page above when the
> leaf page was filled, thus removing the need to descend the btree. I
> imagine we can save some performance by mirroring that in GIN too,
> with as additional bonus that we'd be free to start logging completed
> pages before we're done with the full index, reducing max WAL
> throughput in GIN index creation.

I agree this is a promising direction as well.

>> > I think they might be complementary. I don't think it's reasonable to
>> > expect GIN's BuildAccumulator to buffer all the index tuples at the
>> > same time (as I mentioned upthread: we are or should be limited by
>> > work memory), but the BuildAccumulator will do a much better job at
>> > combining tuples than the in-memory sort + merge-write done by
>> > Tuplesort (because BA will use (much?) less memory for the same number
>> > of stored values).
>>
>> Thank you Matthias for valuing my point! and thanks for highlighting the
>> benefit that BuildAccumulator can do a better job for sorting in memory
>> (I think it is mainly because BuildAccumulator can do run-time merge
>> when accept more tuples). but I still not willing to go further at this
>> direction. Reasons are: a). It probably can't make a big difference at
>> the final result. b). The best implementation of this idea would be
>> allowing the user of tuplesort.c to insert the pre-sort tuples into tape
>> directly rather than inserting them into tuplesort's memory and dump
>> them into tape without a sort.
>
> You'd still need to keep track of sorted runs on those tapes, which is what
> tuplesort.c does for us.
>
>> However I can't define a clean API for
>> the former case.
>
> I imagine a pair of tuplesort_beginsortedrun();
> tuplesort_endsortedrun() -functions to help this.

This APIs do are better than the ones in my mind:) during the range
between tuplesort_beginsortedrun and tuplesort_endsortedrun(), we can
bypass the tuplessort's memory.

> but I'm not 100%
> sure if we'd want to expose Tuplesort to non-PG sorting algorithms, as
> it would be one easy way to create incorrect results if the sort used
> in tuplesort isn't exactly equivalent to the sort used by the provider
> of the tuples.

OK.

>
>> c). create-index is a maintenance work, improving it by
>> 30% would be good, but if we just improve it by <3, it looks not very
>> charming in practice.
>
> I think improving 3% on reindex operations can be well worth the effort.

> Also, do note that the current patch does (still) not correctly handle
> [maintenance_]work_mem: Every backend's BuildAccumulator uses up to
> work_mem of memory here, while the launched tuplesorts use an
> additional maintenance_work_mem of memory, for a total of (workers +
> 1) * work_mem + m_w_m of memory usage. The available memory should
> instead be allocated between tuplesort and BuildAccumulator, but can
> probably mostly be allocated to just BuildAccumulator if we can dump
> the data into the tuplesort directly, as it'd reduce the overall
> number of operations and memory allocations for the tuplesort. I think
> that once we correctly account for memory allocations (and an improved
> write path) we'll be able to see a meaningfully larger performance
> improvement.

Personally I am more fans of your "buffer writetup" idea, but not the
same interests with the tuplesort_beginsortedrun /
tuplesort_endsortedrun. I said the '3%' is for the later one and I
guess you understand it as the former one.
>
>> So my option is if we can have agreement on 0008, then we can final
>> review/test on the existing code (including 0009), and leave further
>> improvement as a dedicated patch.
>
> As mentioned above, I think I could update the patch for a btree
> implementation that also has immediate benefits, if so desired?

If you are saying about the buffered-writetup in tuplesort, then I think
it is great, and in a dedicated thread for better exposure.

--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2024-08-28 00:40:47 Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
Previous Message Thomas Munro 2024-08-28 00:36:56 Re: Segfault in jit tuple deforming on arm64 due to LLVM issue