Re: Parallel CREATE INDEX for GIN indexes

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for GIN indexes
Date: 2024-08-27 10:14:46
Message-ID: 87a5gyqnl5.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:

> Hi,
>
> I got to do the detailed benchmarking on the latest version of the patch
> series, so here's the results. My goal was to better understand the
> impact of each patch individually - especially the two parts introduced
> by Matthias, but not only - so I ran the test on a build with each fo
> the 0001-0009 patches.
>
> This is the same test I did at the very beginning, but the basic details
> are that I have a 22GB table with archives of our mailing lists (1.6M
> messages, roughly), and I build a couple different GIN indexes on
> that:
..
>

Very impresive testing!

> And let's talk about the improvement by Matthias, namely:
>
> * 0008 Use a single GIN tuplesort
> * 0009 Reduce the size of GinTuple by 12 bytes
>
> I haven't really seen any impact on duration - it seems more or less
> within noise. Maybe it would be different on machines with less RAM, but
> on my two systems it didn't really make a difference.
>
> It did significantly reduce the amount of temporary data written, by
> ~40% or so. This is pretty nicely visible on the "trgm" case, which
> generates the most temp files of the four indexes. An example from the
> i5/32MB section looks like this:
>
> label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
> ------------------------------------------------------------------------
> trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682
> 1016

After seeing the above data, I want to know where the time is spent and
why the ~40% IO doesn't make a measurable duration improvement. then I
did the following test.

create table gin_t (a int[]);
insert into gin_t select * from rand_array(30000000, 0, 100, 0, 50); [1]
select pg_prewarm('gin_t');

postgres=# create index on gin_t using gin(a);
INFO: pid: 145078, stage 1 took 44476 ms
INFO: pid: 145177, stage 1 took 44474 ms
INFO: pid: 145078, stage 2 took 2662 ms
INFO: pid: 145177, stage 2 took 2611 ms
INFO: pid: 145177, stage 3 took 240 ms
INFO: pid: 145078, stage 3 took 239 ms

CREATE INDEX
Time: 79472.135 ms (01:19.472)

Then we can see stage 1 take 56% execution time. stage 2 + stage 3 take
3% execution time. and the leader's work takes the rest 41% execution
time. I think that's why we didn't see much performance improvement of
0008 since it improves the "stage 2 and stage 3".

==== Here is my definition for stage 1/2/3.
stage 1:
reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
ginBuildCallbackParallel, state, scan);

/* write remaining accumulated entries */
ginFlushBuildState(state, index);

stage 2:
_gin_process_worker_data(state, state->bs_worker_sort)

stage 3:
tuplesort_performsort(state->bs_sortstate);

But 0008 still does many good stuff:
1. Reduce the IO usage, this would be more useful on some heavily IO
workload.
2. Simplify the building logic by removing one stage.
3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
case for now, but it looks like a reasonable design.

I think the current blocker is if it is safe to hack the tuplesort.c.
With my current knowledge, It looks good to me, but it would be better
open a dedicated thread to discuss this specially, the review would not
take a long time if a people who is experienced on this area would take
a look.

> Now, what's the 0010 patch about?
>
> For some indexes (e.g. trgm), the parallel builds help a lot, because
> they produce a lot of temporary data and the parallel sort is a
> substantial part of the work. But for other indexes (especially the
> "smaller" indexes on jsonb headers), it's not that great. For example
> for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.
>
> Clearly, this happens because a lot of time is spent outside the sort,
> actually inserting data into the index.

You can always foucs on the most important part which inpires me a lot,
even with my simple testing, the "inserting data into index" stage take
40% time.

> So I was wondering if we might
> parallelize that too, and how much time would it save - 0010 is an
> experimental patch doing that. It splits the processing into 3 phases:
>
> 1. workers feeding data into tuplesort
> 2. leader finishes sort and "repartitions" the data
> 3. workers inserting their partition into index
>
> The patch is far from perfect (more a PoC) ..
>
> This does help a little bit, reducing the duration by ~15-25%. I wonder
> if this might be improved by partitioning the data differently - not by
> shuffling everything from the tuplesort into fileset (it increases the
> amount of temporary data in the charts). And also by by distributing the
> data differently - right now it's a bit of a round robin, because it
> wasn't clear we know how many entries are there.

Due to the complexity of the existing code, I would like to foucs on
existing patch first. So I vote for this optimization as a dedeciated
patch.

>>> and later we called 'tuplesort_performsort(state->bs_sortstate);'. Even
>>> we have some CTID merges activity in '....(1)', the tuples are still
>>> ordered, so the sort (in both tuplesort_putgintuple and
>>> 'tuplesort_performsort) are not necessary,
>>> ..
>>> If Matthias's proposal is adopted, my optimization will not be useful
>>> anymore and Matthias's porposal looks like a more natural and effecient
>>> way.
>
> I think they might be complementary. I don't think it's reasonable to
> expect GIN's BuildAccumulator to buffer all the index tuples at the
> same time (as I mentioned upthread: we are or should be limited by
> work memory), but the BuildAccumulator will do a much better job at
> combining tuples than the in-memory sort + merge-write done by
> Tuplesort (because BA will use (much?) less memory for the same number
> of stored values).

Thank you Matthias for valuing my point! and thanks for highlighting the
benefit that BuildAccumulator can do a better job for sorting in memory
(I think it is mainly because BuildAccumulator can do run-time merge
when accept more tuples). but I still not willing to go further at this
direction. Reasons are: a). It probably can't make a big difference at
the final result. b). The best implementation of this idea would be
allowing the user of tuplesort.c to insert the pre-sort tuples into tape
directly rather than inserting them into tuplesort's memory and dump
them into tape without a sort. However I can't define a clean API for
the former case. c). create-index is a maintenance work, improving it by
30% would be good, but if we just improve it by <3, it looks not very
charming in practice.

So my option is if we can have agreement on 0008, then we can final
review/test on the existing code (including 0009), and leave further
improvement as a dedicated patch.

What do you think?

[1] https://www.postgresql.org/message-id/87le0iqrsu.fsf%40163.com

--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-08-27 10:24:55 Re: Doc: fix the note related to the GUC "synchronized_standby_slots"
Previous Message David Rowley 2024-08-27 10:14:07 Re: Significant Execution Time Difference Between PG13.14 and PG16.4 for Query on information_schema Tables.