Quick Links

Re: Parallel CREATE INDEX for GIN indexes

From:	Andy Fan <zhihuifan1213(at)163(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Parallel CREATE INDEX for GIN indexes
Date:	2024-05-09 10:14:40
Message-ID:	87pltvmgdm.fsf@163.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> writes:

> 3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch
>
> In 0002 the workers still do an explicit qsort() on the TID list before
> writing the data into the shared tuplesort. But we can do better - the
> workers can do a merge sort too. To help with this, we add the first TID
> to the tuplesort tuple, and sort by that too - it helps the workers to
> process the data in an order that allows simple concatenation instead of
> the full mergesort.
>
> Note: There's a non-obvious issue due to parallel scans always being
> "sync scans", which may lead to very "wide" TID ranges when the scan
> wraps around. More about that later.

This is really amazing.

> 7) v20240502-0007-Detect-wrap-around-in-parallel-callback.patch
>
> There's one more efficiency problem - the parallel scans are required to
> be synchronized, i.e. the scan may start half-way through the table, and
> then wrap around. Which however means the TID list will have a very wide
> range of TID values, essentially the min and max of for the key.
>
> Without 0006 this would cause frequent failures of the index build, with
> the error I already mentioned:
>
> ERROR: could not split GIN page; all old items didn't fit

I have two questions here and both of them are generall gin index questions
rather than the patch here.

1. What does the "wrap around" mean in the "the scan may start half-way
through the table, and then wrap around". Searching "wrap" in
gin/README gets nothing.

2. I can't understand the below error.

> ERROR: could not split GIN page; all old items didn't fit

When the posting list is too long, we have posting tree strategy. so in
which sistuation we could get this ERROR.

> issue with efficiency - having such a wide TID list forces the mergesort
> to actually walk the lists, because this wide list overlaps with every
> other list produced by the worker.

If we split the blocks among worker 1-block by 1-block, we will have a
serious issue like here. If we can have N-block by N-block, and N-block
is somehow fill the work_mem which makes the dedicated temp file, we
can make things much better, can we?

--
Best Regards
Andy Fan

In response to

Parallel CREATE INDEX for GIN indexes at 2024-05-02 15:19:03 from Tomas Vondra

Responses

Re: Parallel CREATE INDEX for GIN indexes at 2024-05-09 12:19:23 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dagfinn Ilmari Mannsåker	2024-05-09 10:22:06	Re: First draft of PG 17 release notes
Previous Message	jian he	2024-05-09 10:00:24	Re: First draft of PG 17 release notes