From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
Cc: | Kirill Reshke <reshkekirill(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Parallel CREATE INDEX for GIN indexes |
Date: | 2025-02-17 14:56:51 |
Message-ID: | 2896ac0f-a3af-4321-8f94-380b9906c215@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2/17/25 14:01, Matthias van de Meent wrote:
> On Sun, 16 Feb 2025 at 04:47, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> Hi,
>>
>> Attached is a cleaned up version of the patch series, squashed into
>> fewer patches as discussed. I also went through all the comments, and
>> removed/updated some obsolete ones. I also updated the commit messages,
>> it'd be nice if someone could read through those, to make sure it's
>> clear enough.
>>
>> While cleaning the comments, I realized there's a couple remaining XXX
>> and FIXME comments, with some valid open questions.
>>
>> 1) There are two places that explicitly zero memory, suggesting it's
>> because of padding causing issues in valgrind (in tuplesort). I need to
>> check if that's still true, but I wonder what do the other tuplesort
>> variants write stuff without tripping valgrind. Maybe the GinTuple is
>> too unique.
>
>
>
>> 2) ginBuildCallbackParallel says this about memory limits:
>>
>> * XXX It might seem this should set the memory limit to 32MB, same as
>> * what plan_create_index_workers() uses to calculate the number of
>> * parallel workers, but that's the limit for tuplesort. So it seems
>> * better to keep using work_mem here.
>> *
>> * XXX But maybe we should calculate this as a per-worker fraction of
>> * maintenance_work_mem. It's weird to use work_mem here, in a clearly
>> * maintenance command.
>>
>> The function uses work_mem to limit the amount of memory used by each
>> worker, which seems a bit strange - it's a maintenance operation, so it
>> would be more appropriate to use maintenance_work_mem I guess.
>>
>> I see the btree code also uses work_mem in some cases when building the
>> index, although that uses it to size the tuplesort. And here we have
>> both the tuplesorts (sized just like in nbtree code), but also the
>> buffer used to accumulate entries.
>
> I think that's a bug in btree code.
>
Not sure I'd call it a bug, but it's certainly a bit confusing. A
maintenance operation using a mix of maintenance_work_mem and work_mem
seems a bit unexpected.
I kinda understand the logic (which predates parallel builds), based on
the assumption that [m_w_m >> w_m]. Which is generally true, but with
parallel builds this is [(m_w_m/k) >> w_m], and that's less likely. The
code actually uses Min(sortmem, work_mem), so it'll likely use sortmem
anyway, and the whole work_mem discussion seems a bit moot.
Anyway, it's not my ambition to rethink this part of nbtree builds. It's
a long-standing behavior, no one ever complained about it (AFAIK).
For GIN I propose to do roughly what the attached 0002 does - split the
sortmem and use half for the tuplesort, half for accumulating entries.
That's the best idea I have, and by default this will use ~16MB for each
(because of how plan_create_index_workers picks worker count), which
seems reasonable. I can imagine being smarter and capping the tuplesort
memory (to give preference to accumulating more entries before having to
flush it into the buffer).
There are ways to force more clients than would be allowed by the usual
logic in plan_create_index_workers (e.g. by setting it for the relation)
but I'd say that's user's choice.
>> I wonder if maybe the right solution would be to use half the allowance
>> for tuplesort and half for the buffer. In the workers the allowance is
>>
>> maintenance_work_mem / ginleader->nparticipanttuplesorts
>>
>> while in the leader it's maintenance_work_mem. Opinions?
>
> Why is the allowance in the leader not affected by memory usage of
> parallel workers? Shouldn't that also be m_w_m / nparticipants?
>
> IIRC, in nbtree, the leader will use (n_planned_part -
> n_launched_part) * (m_w_m / n_planned_part), which in practice is 1 *
> (m_w_m / n_planned_part).
>
Sorry, I didn't write it very clearly. For the parallel part (when
acting as a worker), the leader will use the same amount of memory as
any other worker.
What I meant is that the shared tuplesort is allocated like this:
state->bs_sortstate =
tuplesort_begin_index_gin(heap, index,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);
so the final sort will use m_w_m. But that's fine, the leader does not
need to accumulate a lot of entries anyway - it'll keep a single key and
then flush it when the TID list gets too long.
>> 3) There's a XXX comment suggesting to use a separate memory context for
>> the GinBuffer, but I decided it doesn't seem really necessary. We're not
>> running any complex function or anything like that in this code, so I
>> don't see a huge benefit of a separate context.
>>
>> I know the patch reworking this to use a single tuplesort actually adds
>> the memory context, maybe it's helpful for that patch. But for now I
>> don't see the point.
>
> I think I added that for the reduced cost of memory cleanup using mctx
> resets vs repeated pfree(), when we're in the tuplesort merge phase.
>
Hmm ... I don't think I've ever seen this as a very expensive part of
the build. It's not like we have a huge number of pointers to free,
pretty much just the "items" array of TIDs, right? But that'll be either
small (and we'll just shove it into the cache), or large (and then we'll
do a "full" free, swamping any other costs).
I plan to do nothing about this for now. We may rethink later, if it
ever happens to be an issue.
>> 4) The patch saving 12B in the GinTuple also added this comment:
>>
>> * XXX: Update description with new architecture
>>
>> but I'm a bit unsure what exactly is meant but "architecture" or what
>> should I add a description for.
>
> I worked on both 'smaller GinTuple' and the 'single tuplesort' patch
> in one go, without any intermediate commits to delineate work on
> either patch, and picked the changes for 'smaller GinTuple' from that
> large pile of changes. That XXX was supposed to go into the second
> patch, and was there to signal me to update the overarching
> architecture documentation for Parallel GIN index builds (which I
> subsequently forgot to do with the 'single tuplesort' patch). So, it's
> not relevant to the 12B patch, but could be relevant to what is now
> patch 0004.
>
OK, thanks. I've removed the comment from the 0001 patch.
Also, while stress-testing the patches, I ran into a bug in the part
WIP: parallel inserts into GIN index
The patch adds a barrier into _gin_begin_parallel(), and it initializes
it like this:
BarrierInit(&ginshared->build_barrier, scantuplesortstates);
The trouble is this is before we launch the workers, and the count is
just what we ask for - but we may not actually get that many workers. In
which case the BarrierArriveAndWait() at the end hangs forever.
Why does this even need the barrier? Doesn't _gin_parallel_heapscan() do
exactly the wait this is meant to do?
regards
--
Tomas Vondra
Attachment | Content-Type | Size |
---|---|---|
v20250217-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patch | text/x-patch | 68.7 KB |
v20250217-0002-WIP-don-t-use-work_mem-to-size-buffers.patch | text/x-patch | 1.9 KB |
v20250217-0003-Compress-TID-lists-when-writing-GIN-tuples.patch | text/x-patch | 8.2 KB |
v20250217-0004-Enforce-memory-limit-during-parallel-GIN-b.patch | text/x-patch | 12.3 KB |
v20250217-0005-Use-a-single-GIN-tuplesort.patch | text/x-patch | 31.6 KB |
v20250217-0006-WIP-parallel-inserts-into-GIN-index.patch | text/x-patch | 17.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Japin Li | 2025-02-17 14:58:52 | Re: Incorrect translator comment for ListenServerPort()? |
Previous Message | Alena Rybakina | 2025-02-17 14:46:16 | Re: Vacuum statistics |