From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Parallel CREATE INDEX for GIN indexes |
Date: | 2024-05-02 18:22:23 |
Message-ID: | 5b5a059d-0699-4cc2-b9a0-32053ea19a65@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/2/24 19:12, Matthias van de Meent wrote:
> On Thu, 2 May 2024 at 17:19, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> Hi,
>>
>> In PG17 we shall have parallel CREATE INDEX for BRIN indexes, and back
>> when working on that I was thinking how difficult would it be to do
>> something similar to do that for other index types, like GIN. I even had
>> that on my list of ideas to pitch to potential contributors, as I was
>> fairly sure it's doable and reasonably isolated / well-defined.
>>
>> However, I was not aware of any takers, so a couple days ago on a slow
>> weekend I took a stab at it. And yes, it's doable - attached is a fairly
>> complete, tested and polished version of the feature, I think. It turned
>> out to be a bit more complex than I expected, for reasons that I'll get
>> into when discussing the patches.
>
> This is great. I've been thinking about approximately the same issue
> recently, too, but haven't had time to discuss/implement any of this
> yet. I think some solutions may even be portable to the btree parallel
> build: it also has key deduplication (though to a much smaller degree)
> and could benefit from deduplication during the scan/ssup load phase,
> rather than only during insertion.
>
Perhaps, although I'm not that familiar with the details of btree
builds, and I haven't thought about it when working on this over the
past couple days.
>> First, let's talk about the benefits - how much faster is that than the
>> single-process build we have for GIN indexes? I do have a table with the
>> archive of all our mailing lists - it's ~1.5M messages, table is ~21GB
>> (raw dump is about 28GB). This does include simple text data (message
>> body), JSONB (headers) and tsvector (full-text on message body).
>
> Sidenote: Did you include the tsvector in the table to reduce time
> spent during index creation? I would have used an expression in the
> index definition, rather than a direct column.
>
Yes, it's a materialized column, not computed during index creation.
>> If I do CREATE index with different number of workers (0 means serial
>> build), I get this timings (in seconds):
>
> [...]
>
>> This shows the benefits are pretty nice, depending on the opclass. For
>> most indexes it's maybe ~3-4x faster, which is nice, and I don't think
>> it's possible to do much better - the actual index inserts can happen
>> from a single process only, which is the main limit.
>
> Can we really not insert with multiple processes? It seems to me that
> GIN could be very suitable for that purpose, with its clear double
> tree structure distinction that should result in few buffer conflicts
> if different backends work on known-to-be-very-different keys.
> We'd probably need multiple read heads on the shared tuplesort, and a
> way to join the generated top-level subtrees, but I don't think that
> is impossible. Maybe it's work for later effort though.
>
Maybe, but I took it as a restriction and it seemed too difficult to
relax (or at least I assume that).
> Have you tested and/or benchmarked this with multi-column GIN indexes?
>
I did test that, and I'm not aware of any bugs/issues. Performance-wise
it depends on which opclasses are used by the columns - if you take the
speedup for each of them independently, the speedup for the whole index
is roughly the average of that.
>> For some of the opclasses it can regress (like the jsonb_path_ops). I
>> don't think that's a major issue. Or more precisely, I'm not surprised
>> by it. It'd be nice to be able to disable the parallel builds in these
>> cases somehow, but I haven't thought about that.
>
> Do you know why it regresses?
>
No, but one thing that stands out is that the index is much smaller than
the other columns/opclasses, and the compression does not save much
(only about 5% for both phases). So I assume it's the overhead of
writing writing and reading a bunch of GB of data without really gaining
much from doing that.
>> I do plan to do some tests with btree_gin, but I don't expect that to
>> behave significantly differently.
>>
>> There are small variations in the index size, when built in the serial
>> way and the parallel way. It's generally within ~5-10%, and I believe
>> it's due to the serial build adding the TIDs incrementally, while the
>> build adds them in much larger chunks (possibly even in one chunk with
>> all the TIDs for the key).
>
> I assume that was '[...] while the [parallel] build adds them [...]', right?
>
Right. The parallel build adds them in larger chunks.
>> I believe the same size variation can happen
>> if the index gets built in a different way, e.g. by inserting the data
>> in a different order, etc. I did a number of tests to check if the index
>> produces the correct results, and I haven't found any issues. So I think
>> this is OK, and neither a problem nor an advantage of the patch.
>>
>>
>> Now, let's talk about the code - the series has 7 patches, with 6
>> non-trivial parts doing changes in focused and easier to understand
>> pieces (I hope so).
>
> The following comments are generally based on the descriptions; I
> haven't really looked much at the patches yet except to validate some
> assumptions.
>
OK
>> 1) v20240502-0001-Allow-parallel-create-for-GIN-indexes.patch
>>
>> This is the initial feature, adding the "basic" version, implemented as
>> pretty much 1:1 copy of the BRIN parallel build and minimal changes to
>> make it work for GIN (mostly about how to store intermediate results).
>>
>> The basic idea is that the workers do the regular build, but instead of
>> flushing the data into the index after hitting the memory limit, it gets
>> written into a shared tuplesort and sorted by the index key. And the
>> leader then reads this sorted data, accumulates the TID for a given key
>> and inserts that into the index in one go.
>
> In the code, GIN insertions are still basically single btree
> insertions, all starting from the top (but with many same-valued
> tuples at once). Now that we have a tuplesort with the full table's
> data, couldn't the code be adapted to do more efficient btree loading,
> such as that seen in the nbtree code, where the rightmost pages are
> cached and filled sequentially without requiring repeated searches
> down the tree? I suspect we can gain a lot of time there.
>
> I don't need you to do that, but what's your opinion on this?
>
I have no idea. I started working on this with only very basic idea of
how GIN works / is structured, so I simply leveraged the existing
callback and massaged it to work in the parallel case too.
>> 2) v20240502-0002-Use-mergesort-in-the-leader-process.patch
>>
>> The approach implemented by 0001 works, but there's a little bit of
>> issue - if there are many distinct keys (e.g. for trigrams that can
>> happen very easily), the workers will hit the memory limit with only
>> very short TID lists for most keys. For serial build that means merging
>> the data into a lot of random places, and in parallel build it means the
>> leader will have to merge a lot of tiny lists from many sorted rows.
>>
>> Which can be quite annoying and expensive, because the leader does so
>> using qsort() in the serial part. It'd be better to ensure most of the
>> sorting happens in the workers, and the leader can do a mergesort. But
>> the mergesort must not happen too often - merging many small lists is
>> not cheaper than a single qsort (especially when the lists overlap).
>>
>> So this patch changes the workers to process the data in two phases. The
>> first works as before, but the data is flushed into a local tuplesort.
>> And then each workers sorts the results it produced, and combines them
>> into results with much larger TID lists, and those results are written
>> to the shared tuplesort. So the leader only gets very few lists to
>> combine for a given key - usually just one list per worker.
>
> Hmm, I was hoping we could implement the merging inside the tuplesort
> itself during its own flush phase, as it could save significantly on
> IO, and could help other users of tuplesort with deduplication, too.
>
Would that happen in the worker or leader process? Because my goal was
to do the expensive part in the worker, because that's what helps with
the parallelization.
>> 3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch
>>
>> In 0002 the workers still do an explicit qsort() on the TID list before
>> writing the data into the shared tuplesort. But we can do better - the
>> workers can do a merge sort too. To help with this, we add the first TID
>> to the tuplesort tuple, and sort by that too - it helps the workers to
>> process the data in an order that allows simple concatenation instead of
>> the full mergesort.
>>
>> Note: There's a non-obvious issue due to parallel scans always being
>> "sync scans", which may lead to very "wide" TID ranges when the scan
>> wraps around. More about that later.
>
> As this note seems to imply, this seems to have a strong assumption
> that data received in parallel workers is always in TID order, with
> one optional wraparound. Non-HEAP TAMs may break with this assumption,
> so what's the plan on that?
>
Well, that would break the serial build too, right? Anyway, the way this
patch works can be extended to deal with that by actually sorting the
TIDs when serializing the tuplestore tuple. The consequence of that is
the combining will be more expensive, because it'll require a proper
mergesort, instead of just appending the lists.
>> 4) v20240502-0004-Compress-TID-lists-before-writing-tuples-t.patch
>>
>> The parallel build passes data between processes using temporary files,
>> which means it may need significant amount of disk space. For BRIN this
>> was not a major concern, because the summaries tend to be pretty small.
>>
>> But for GIN that's not the case, and the two-phase processing introduced
>> by 0002 make it worse, because the worker essentially creates another
>> copy of the intermediate data. It does not need to copy the key, so
>> maybe it's not exactly 2x the space requirement, but in the worst case
>> it's not far from that.
>>
>> But there's a simple way how to improve this - the TID lists tend to be
>> very compressible, and GIN already implements a very light-weight TID
>> compression, so this patch does just that - when building the tuple to
>> be written into the tuplesort, we just compress the TIDs.
>
> See note on 0002: Could we do this in the tuplesort writeback, rather
> than by moving the data around multiple times?
>
No idea, I've never done that ...
> [...]
>> So 0007 does something similar - it tracks if the TID value goes
>> backward in the callback, and if it does it dumps the state into the
>> tuplesort before processing the first tuple from the beginning of the
>> table. Which means we end up with two separate "narrow" TID list, not
>> one very wide one.
>
> See note above: We may still need a merge phase, just to make sure we
> handle all TAM parallel scans correctly, even if that merge join phase
> wouldn't get hit in vanilla PostgreSQL.
>
Well, yeah. But in fact the parallel code already does that, while the
existing serial code may fail with the "data don't fit" error.
The parallel code will do the mergesort correctly, and only emit TIDs
that we know are safe to write to the index (i.e. no future TIDs will go
before the "TID horizon").
But the serial build has nothing like that - it will sort the TIDs that
fit into the memory limit, but it also relies on not processing data out
of order (and disables sync scans to not have wrap around issues). But
if the TAM does something funny, this may break.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2024-05-02 18:24:18 | Re: Idea Feedback: psql \h misses -> Offers Links? |
Previous Message | Andrey M. Borodin | 2024-05-02 17:50:33 | Re: Idea Feedback: psql \h misses -> Offers Links? |