From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
Cc: | Andy Fan <zhihuifan1213(at)163(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Parallel CREATE INDEX for GIN indexes |
Date: | 2024-05-09 19:26:04 |
Message-ID: | e05c0edf-b75a-4f09-ae2d-a6d0cef81748@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/9/24 17:51, Matthias van de Meent wrote:
> On Thu, 9 May 2024 at 15:13, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> Let me explain the relevant part of the patch, and how I understand the
>> improvement suggested by Matthias. The patch does the work in three phases:
>>
>>
>> 1) Worker gets data from table, split that into index items and add
>> those into a "private" tuplesort, and finally sorts that. So a worker
>> may see a key many times, with different TIDs, so the tuplesort may
>> contain many items for the same key, with distinct TID lists:
>>
>> key1: 1, 2, 3, 4
>> key1: 5, 6, 7
>> key1: 8, 9, 10
>> key2: 1, 2, 3
>> ...
>
> This step is actually split in several components/phases, too.
> As opposed to btree, which directly puts each tuple's data into the
> tuplesort, this GIN approach actually buffers the tuples in memory to
> generate these TID lists for data keys, and flushes these pairs of Key
> + TID list into the tuplesort when its own memory limit is exceeded.
> That means we essentially double the memory used for this data: One
> GIN deform buffer, and one in-memory sort buffer in the tuplesort.
> This is fine for now, but feels duplicative, hence my "let's allow
> tuplesort to merge the key+TID pairs into pairs of key+TID list"
> comment.
>
True, although the "GIN deform buffer" (flushed by the callback if using
too much memory) likely does most of the merging already. If it only
happened in the tuplesort merge, we'd likely have far more tuples and
overhead associated with that. So we certainly won't get rid of either
of these things.
You're right the memory limits are a bit unclear, and need more thought.
I certainly have not thought very much about not using more than the
specified maintenance_work_mem amount. This includes the planner code
determining the number of workers to use - right now it simply does the
same thing as for btree/brin, i.e. assumes each workers uses 32MB of
memory and checks how many workers fit into maintenance_work_mem.
That was a bit bogus even for BRIN, because BRIN sorts only summaries,
which is typically tiny - perhaps a couple kB, much less than 32MB. But
it's still just one sort, and some opclasses may be much larger (like
bloom, for example). So I just went with it.
But for GIN it's more complicated, because we have two tuplesorts (not
sure if both can use the memory at the same time) and the GIN deform
buffer. Which probably means we need to have a per-worker allowance
considering all these buffers.
>> The trouble with (2) is that it "just copies" data from one tuplesort
>> into another, increasing the disk space requirements. In an extreme
>> case, when nothing can be combined, it pretty much doubles the amount of
>> disk space, and makes the build longer.
>>
>> What I think Matthias is suggesting, is that this "TID list merging"
>> could be done directly as part of the tuplesort in step (1). So instead
>> of just moving the "sort tuples" from the appropriate runs, it could
>> also do an optional step of combining the tuples and writing this
>> combined tuple into the tuplesort result (for that worker).
>
> Yes, but with a slightly more extensive approach than that even, see above.
>
>> Matthias also mentioned this might be useful when building btree indexes
>> with key deduplication.
>>
>> AFAICS this might work, although it probably requires for the "combined"
>> tuple to be smaller than the sum of the combined tuples (in order to fit
>> into the space).
>
> *must not be larger than the sum; not "must be smaller than the sum" [^0].
Yeah, I wrote that wrong.
> For btree tuples with posting lists this is guaranteed to be true: The
> added size of a btree tuple with a posting list (containing at least 2
> values) vs one without is the maxaligned size of 2 TIDs, or 16 bytes
> (12 on 32-bit systems). The smallest btree tuple with data is also 16
> bytes (or 12 bytes on 32-bit systems), so this works out nicely.
>
>> But at least in the GIN build in the workers this is
>> likely true, because the TID lists do not overlap (and thus not hurting
>> the compressibility).
>>
>> That being said, I still see this more as an optimization than something
>> required for the patch,
>
> Agreed.
>
OK
>> and I don't think I'll have time to work on this
>> anytime soon. The patch is not extremely complex, but it's not trivial
>> either. But if someone wants to take a stab at extending tuplesort to
>> allow this, I won't object ...
>
> Same here: While I do have some ideas on where and how to implement
> this, I'm not planning on working on that soon.
>
Understood. I don't have a very good intuition on how significant the
benefit could be, which is one of the reasons why I have not prioritized
this very much.
I did a quick experiment, to measure how expensive it is to build the
second worker tuplesort - for the pg_trgm index build with 2 workers, it
takes ~30seconds. The index build takes ~550s in total, so 30s is ~5%.
If we eliminated all of this work we'd save this, but in reality some of
it will still be necessary.
Perhaps it's more significant for other indexes / slower storage, but it
does not seem like a *must have* for v1.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2024-05-09 19:28:13 | open items |
Previous Message | Robert Haas | 2024-05-09 19:19:00 | Re: request for database identifier in the startup packet |