From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Parallel CREATE INDEX for BRIN indexes |
Date: | 2023-07-14 15:01:12 |
Message-ID: | CAEze2Wgb+V1AbYueaJ6zMEEBwCMMRsbnQDn-iXV5fXN1JTFQpA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 14 Jul 2023 at 15:57, Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 7/11/23 23:11, Matthias van de Meent wrote:
>> On Thu, 6 Jul 2023 at 16:13, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>
>>> One thing I was wondering about is whether it might be better to allow
>>> the workers to process overlapping ranges, and then let the leader to
>>> merge the summaries. That would mean we might not need the tableam.c
>>> changes at all, but the leader would need to do more work (although the
>>> BRIN indexes tend to be fairly small). The main reason that got me
>>> thinking about this is that we have pretty much no tests for the union
>>> procedures, because triggering that is really difficult. But for
>>> parallel index builds that'd be much more common.
>>
>> Hmm, that's a good point. I don't mind either way, but it would add
>> overhead in the leader to do all of that merging - especially when you
>> configure pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE as we'd
>> need to merge up to parallel_workers tuples. That could be a
>> significant overhead.
>>
>> ... thinks a bit.
>>
>> Hmm, but with the current P_S_M_C_S of 8192 blocks that's quite
>> unlikely to be a serious problem - the per-backend IO saved of such
>> large ranges on a single backend has presumably much more impact than
>> the merging of n_parallel_tasks max-sized brin tuples. So, seems fine
>> with me.
>>
>
> As for PARALLEL_SEQSCAN_MAX_CHUNK_SIZE, the last patch actually
> considers the chunk_factor (i.e. pages_per_range) *after* doing
>
> pbscanwork->phsw_chunk_size = Min(pbscanwork->phsw_chunk_size,
> PARALLEL_SEQSCAN_MAX_CHUNK_SIZE);
>
> so even with (pages_per_range > PARALLEL_SEQSCAN_MAX_CHUNK_SIZE) it
> would not need to merge anything.
>
> Now, that might have been a bad idea and PARALLEL_SEQSCAN_MAX_CHUNK_SIZE
> should be considered. In which case this *has* to do the union, even if
> only for the rare corner case.
>
> But I don't think that's a major issue - it's pretty sure summarizing
> the tuples is way more expensive than merging the summaries. Which is
> what matters for Amdahl's law ...
Agreed.
>>> + BrinSpool *bs_spool;
>>> + BrinSpool *bs_spool_out;
>>
>> Are both used? If so, could you add comments why we have two spools
>> here, instead of only one?
>>
>
> OK, I admit I'm not sure both are actually necessary. I was struggling
> getting it working with just one spool, because when the leader
> participates as a worker, it needs to both summarize some of the chunks
> (and put the tuples somewhere). And then it also needs to consume the
> final output.
>
> Maybe it's just a case of cargo cult programming - I was mostly copying
> stuff from the btree index build, trying to make it work, and then with
> two spools it started working.
Two spools seem to be necessary in a participating leader, but both
spools have non-overlapping lifetimes. In the btree code actually two
pairs of spools are actually used (in unique indexes): you can see the
pairs being allocated in both _bt_leader_participate_as_worker (called
from _bt_begin_parallel, from _bt_spools_heapscan) and in
_bt_spools_heapscan.
> >> diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
> >> [...]
> >> + * Computing BrinTuple size with only the tuple is difficult, so we want to track
> >> + * the length for r referenced by SortTuple. That's what BrinSortTuple is meant
> >> + * to do - it's essentially a BrinTuple prefixed by length. We only write the
> >> + * BrinTuple to the logtapes, though.
> >
> > Why don't we write the full BrinSortTuple to disk? Doesn't that make more sense?
> >
>
> Not sure I understand. We ultimately do, because we write
>
> (length + BrinTuple)
>
> and BrinSortTuple is exactly that. But if we write BrinSortTuple, we
> would end up writing length for that too, no?
>
> Or maybe I just don't understand how would that make things simpler.
I don't quite understand the intricacies of the tape storage format
quite yet (specifically, I'm continuously getting confused by the len
-= sizeof(int)), so you might well be correct.
My comment was written based on just the comment's contents, which
claims "we can't easily recompute the length, so we store it with the
tuple in memory. However, we don't store the length when we write it
to the tape", which seems self-contradictory.
> >> + tuplesort_puttuple_common(state, &stup,
> >> + base->sortKeys &&
> >> + base->sortKeys->abbrev_converter &&
> >> + !stup.isnull1);
> >
> > Can't this last argument just be inlined, based on knowledge that we
> > don't use sortKeys in brin?
> >
>
> What does "inlined" mean for an argument? But yeah, I guess it might be
> just set to false. And we should probably have an assert that there are
> no sortKeys.
"inlined", "precomputed", "const-ified"? I'm not super good at
vocabulary. But, indeed, thanks.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-07-14 15:16:26 | Re: New WAL record to detect the checkpoint redo location |
Previous Message | Tomas Vondra | 2023-07-14 14:47:33 | Re: BRIN indexes vs. SK_SEARCHARRAY (and preprocessing scan keys) |