Re: WIP: parallel GiST index builds

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: parallel GiST index builds
Date: 2024-07-21 20:42:22
Message-ID: 9165347f-825a-4885-b9bd-f2fff34f9b3f@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/21/24 21:31, Andrey M. Borodin wrote:
> Hi Tomas!
>
>> On 7 Jun 2024, at 20:41, Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> After looking into parallel builds for BRIN and GIN indexes, I was
>> wondering if there's a way to do parallel builds for GiST too. I
>> knew next to nothing about how GiST works, but I gave it a shot and
>> here's what I have - the attached patch allows parallel GiST builds
>> for the "unsorted" case (i.e. when the opclass does not include
>> sortsupport), and does not support buffered builds.
>
> I think this totally makes sense. I've took a look into tuples
> partitioning (for sorted build) in your Github and I see that it's
> complicated feature. So, probably, we can do it later. I'm trying to
> review the patch as it is now. Currently I have some questions about
> code.

OK. I'm not even sure partitioning is the right approach for sorted
builds. Or how to do it, exactly.

>
> 1. Do I get it right that is_parallel argument for gistGetFakeLSN()
> is only needed for assertion? And this assertion can be ensured just
> by inspecting code. Is it really necessary?

Yes, in the patch it's only used for an assert. But it's actually
incorrect - just as I speculated in my initial message (in the section
about gistGetFakeLSN), it sometimes fails to detect a page split. I
noticed that while testing the patch adding GiST to amcheck, and I
reported that in that thread [1]. But I apparently forgot to post an
updated version of this patch :-(

I'll post a new version tomorrow, but in short it needs to update the
fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
annoying this means we generate a new fake LSN on every call, and I'm
not sure that's actually necessary. But I've been unable to come up with
a better condition when to generate a new LSN.

[1]
https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com

> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
> not used anywhere. AFAIK it's only needed to buffered build. 3. I
> think we need a test that reliably triggers parallel and serial
> builds.
>

Yeah, it's possible the variable is unused. Agreed on the testing.

> As far as I know, there's a well known trick to build better GiST
> over PostGIS data: randomize input. I think parallel scan is just
> what is needed, it will shuffle tuples enough...
>

I'm not sure I understand this comment. What do you mean by "better
GiST" or what does that mean for this patch?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-07-21 21:04:42 Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
Previous Message Joe Conway 2024-07-21 20:34:01 Re: CI, macports, darwin version problems