Re: Parallel CREATE INDEX for BRIN indexes

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for BRIN indexes
Date: 2023-07-05 11:11:06
Message-ID: 254c4ca8-0644-f32c-d4ed-c4eed304054e@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/5/23 10:44, Matthias van de Meent wrote:
> On Wed, 5 Jul 2023 at 00:08, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>>
>>
>> On 7/4/23 23:53, Matthias van de Meent wrote:
>>> On Thu, 8 Jun 2023 at 14:55, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
>>>> infrastructure (starting workers etc.) is "inspired" by the BTREE code
>>>> (i.e. copied from that and massaged a bit to call brin stuff).
>>>
>>> Nice work.
>>>
>>>> In both cases _brin_end_parallel then reads the summaries from worker
>>>> files, and adds them into the index. In 0001 this is fairly simple,
>>>> although we could do one more improvement and sort the ranges by range
>>>> start to make the index nicer (and possibly a bit more efficient). This
>>>> should be simple, because the per-worker results are already sorted like
>>>> that (so a merge sort in _brin_end_parallel would be enough).
>>>
>>> I see that you manually built the passing and sorting of tuples
>>> between workers, but can't we use the parallel tuplesort
>>> infrastructure for that? It already has similar features in place and
>>> improves code commonality.
>>>
>>
>> Maybe. I wasn't that familiar with what parallel tuplesort can and can't
>> do, and the little I knew I managed to forget since I wrote this patch.
>> Which similar features do you have in mind?
>>
>> The workers are producing the results in "start_block" order, so if they
>> pass that to the leader, it probably can do the usual merge sort.
>>
>>>> For 0002 it's a bit more complicated, because with a single parallel
>>>> scan brinbuildCallbackParallel can't decide if a range is assigned to a
>>>> different worker or empty. And we want to generate summaries for empty
>>>> ranges in the index. We could either skip such range during index build,
>>>> and then add empty summaries in _brin_end_parallel (if needed), or add
>>>> them and then merge them using "union".
>>>>
>>>>
>>>> I just realized there's a third option to do this - we could just do
>>>> regular parallel scan (with no particular regard to pagesPerRange), and
>>>> then do "union" when merging results from workers. It doesn't require
>>>> the sequence of TID scans, and the union would also handle the empty
>>>> ranges. The per-worker results might be much larger, though, because
>>>> each worker might produce up to the "full" BRIN index.
>>>
>>> Would it be too much effort to add a 'min_chunk_size' argument to
>>> table_beginscan_parallel (or ParallelTableScanDesc) that defines the
>>> minimum granularity of block ranges to be assigned to each process? I
>>> think that would be the most elegant solution that would require
>>> relatively little effort: table_block_parallelscan_nextpage already
>>> does parallel management of multiple chunk sizes, and I think this
>>> modification would fit quite well in that code.
>>>
>>
>> I'm confused. Isn't that pretty much exactly what 0002 does? I mean,
>> that passes pagesPerRange to table_parallelscan_initialize(), so that
>> each pagesPerRange is assigned to a single worker.
>
> Huh, I overlooked that one... Sorry for that.
>
>> The trouble I described above is that the scan returns tuples, and the
>> consumer has no idea what was the chunk size or how many other workers
>> are there. Imagine you get a tuple from block 1, and then a tuple from
>> block 1000. Does that mean that the blocks in between are empty or that
>> they were processed by some other worker?
>
> If the unit of work for parallel table scans is the index's
> pages_per_range, then I think we can just fill in expected-but-missing
> ranges as 'empty' in the parallel leader during index loading, like
> the first of the two solutions you proposed.
>

Right, I think that's the right solution.

Or rather the only solution, because the other idea (generating the
empty ranges in workers) relies on the workers knowing when to generate
that. But I don't think the workers have the necessary information.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-07-05 11:21:20 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Tom Lane 2023-07-05 11:03:56 Re: pg_upgrade and cross-library upgrades