From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andy Fan <zhihuifan1213(at)163(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Parallel CREATE INDEX for GIN indexes |
Date: | 2024-08-27 11:16:25 |
Message-ID: | c2753e01-9b06-43f0-a9ae-43638ce4bedf@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 8/27/24 12:14, Andy Fan wrote:
> Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:
>
>> Hi,
>>
>> I got to do the detailed benchmarking on the latest version of the patch
>> series, so here's the results. My goal was to better understand the
>> impact of each patch individually - especially the two parts introduced
>> by Matthias, but not only - so I ran the test on a build with each fo
>> the 0001-0009 patches.
>>
>> This is the same test I did at the very beginning, but the basic details
>> are that I have a 22GB table with archives of our mailing lists (1.6M
>> messages, roughly), and I build a couple different GIN indexes on
>> that:
> ..
>>
>
> Very impresive testing!
>
>> And let's talk about the improvement by Matthias, namely:
>>
>> * 0008 Use a single GIN tuplesort
>> * 0009 Reduce the size of GinTuple by 12 bytes
>>
>> I haven't really seen any impact on duration - it seems more or less
>> within noise. Maybe it would be different on machines with less RAM, but
>> on my two systems it didn't really make a difference.
>>
>> It did significantly reduce the amount of temporary data written, by
>> ~40% or so. This is pretty nicely visible on the "trgm" case, which
>> generates the most temp files of the four indexes. An example from the
>> i5/32MB section looks like this:
>>
>> label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
>> ------------------------------------------------------------------------
>> trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682
>> 1016
>
> After seeing the above data, I want to know where the time is spent and
> why the ~40% IO doesn't make a measurable duration improvement. then I
> did the following test.
>
> create table gin_t (a int[]);
> insert into gin_t select * from rand_array(30000000, 0, 100, 0, 50); [1]
> select pg_prewarm('gin_t');
>
> postgres=# create index on gin_t using gin(a);
> INFO: pid: 145078, stage 1 took 44476 ms
> INFO: pid: 145177, stage 1 took 44474 ms
> INFO: pid: 145078, stage 2 took 2662 ms
> INFO: pid: 145177, stage 2 took 2611 ms
> INFO: pid: 145177, stage 3 took 240 ms
> INFO: pid: 145078, stage 3 took 239 ms
>
> CREATE INDEX
> Time: 79472.135 ms (01:19.472)
>
> Then we can see stage 1 take 56% execution time. stage 2 + stage 3 take
> 3% execution time. and the leader's work takes the rest 41% execution
> time. I think that's why we didn't see much performance improvement of
> 0008 since it improves the "stage 2 and stage 3".
>
Yes, that makes sense. It's so small fraction of the computation that it
can't translate to a meaningful speed.
> ==== Here is my definition for stage 1/2/3.
> stage 1:
> reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
> ginBuildCallbackParallel, state, scan);
>
> /* write remaining accumulated entries */
> ginFlushBuildState(state, index);
>
> stage 2:
> _gin_process_worker_data(state, state->bs_worker_sort)
>
> stage 3:
> tuplesort_performsort(state->bs_sortstate);
>
>
> But 0008 still does many good stuff:
> 1. Reduce the IO usage, this would be more useful on some heavily IO
> workload.
> 2. Simplify the building logic by removing one stage.
> 3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
> case for now, but it looks like a reasonable design.
>
> I think the current blocker is if it is safe to hack the tuplesort.c.
> With my current knowledge, It looks good to me, but it would be better
> open a dedicated thread to discuss this specially, the review would not
> take a long time if a people who is experienced on this area would take
> a look.
>
I agree. I expressed the same impression earlier in this thread, IIRC.
>> Now, what's the 0010 patch about?
>>
>> For some indexes (e.g. trgm), the parallel builds help a lot, because
>> they produce a lot of temporary data and the parallel sort is a
>> substantial part of the work. But for other indexes (especially the
>> "smaller" indexes on jsonb headers), it's not that great. For example
>> for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.
>>
>> Clearly, this happens because a lot of time is spent outside the sort,
>> actually inserting data into the index.
>
> You can always foucs on the most important part which inpires me a lot,
> even with my simple testing, the "inserting data into index" stage take
> 40% time.
>>> So I was wondering if we might
>> parallelize that too, and how much time would it save - 0010 is an
>> experimental patch doing that. It splits the processing into 3 phases:
>>
>> 1. workers feeding data into tuplesort
>> 2. leader finishes sort and "repartitions" the data
>> 3. workers inserting their partition into index
>>
>> The patch is far from perfect (more a PoC) ..
>>
>> This does help a little bit, reducing the duration by ~15-25%. I wonder
>> if this might be improved by partitioning the data differently - not by
>> shuffling everything from the tuplesort into fileset (it increases the
>> amount of temporary data in the charts). And also by by distributing the
>> data differently - right now it's a bit of a round robin, because it
>> wasn't clear we know how many entries are there.
>
> Due to the complexity of the existing code, I would like to foucs on
> existing patch first. So I vote for this optimization as a dedeciated
> patch.
>
I agree. Even if we decide to do these parallel inserts, it relies on
doing the parallel sort first. So it makes sense to leave that for
later, as an additional improvement.
>>>> and later we called 'tuplesort_performsort(state->bs_sortstate);'. Even
>>>> we have some CTID merges activity in '....(1)', the tuples are still
>>>> ordered, so the sort (in both tuplesort_putgintuple and
>>>> 'tuplesort_performsort) are not necessary,
>>>> ..
>>>> If Matthias's proposal is adopted, my optimization will not be useful
>>>> anymore and Matthias's porposal looks like a more natural and effecient
>>>> way.
>>
>> I think they might be complementary. I don't think it's reasonable to
>> expect GIN's BuildAccumulator to buffer all the index tuples at the
>> same time (as I mentioned upthread: we are or should be limited by
>> work memory), but the BuildAccumulator will do a much better job at
>> combining tuples than the in-memory sort + merge-write done by
>> Tuplesort (because BA will use (much?) less memory for the same number
>> of stored values).
>
> Thank you Matthias for valuing my point! and thanks for highlighting the
> benefit that BuildAccumulator can do a better job for sorting in memory
> (I think it is mainly because BuildAccumulator can do run-time merge
> when accept more tuples). but I still not willing to go further at this
> direction. Reasons are: a). It probably can't make a big difference at
> the final result. b). The best implementation of this idea would be
> allowing the user of tuplesort.c to insert the pre-sort tuples into tape
> directly rather than inserting them into tuplesort's memory and dump
> them into tape without a sort. However I can't define a clean API for
> the former case. c). create-index is a maintenance work, improving it by
> 30% would be good, but if we just improve it by <3, it looks not very
> charming in practice.
>
> So my option is if we can have agreement on 0008, then we can final
> review/test on the existing code (including 0009), and leave further
> improvement as a dedicated patch.
>
> What do you think?
>
Yeah. I think we have agreement on 0001-0007. I'm a bit torn about 0008,
I have not expected changing tuplesort like this when I started working
on the patch, but I can't deny it's a massive speedup for some cases
(where the patch doesn't help otherwise). But then in other cases it
doesn't help at all, and 0010 helps. I wonder if maybe there's a good
way to "flip" between those two approaches, by some heuristics.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Ranier Vilela | 2024-08-27 11:26:15 | Re: Redundant Result node |
Previous Message | David Rowley | 2024-08-27 11:03:00 | Re: Significant Execution Time Difference Between PG13.14 and PG16.4 for Query on information_schema Tables. |