From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: WIP: parallel GiST index builds |
Date: | 2024-07-21 20:42:22 |
Message-ID: | 9165347f-825a-4885-b9bd-f2fff34f9b3f@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 7/21/24 21:31, Andrey M. Borodin wrote:
> Hi Tomas!
>
>> On 7 Jun 2024, at 20:41, Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> After looking into parallel builds for BRIN and GIN indexes, I was
>> wondering if there's a way to do parallel builds for GiST too. I
>> knew next to nothing about how GiST works, but I gave it a shot and
>> here's what I have - the attached patch allows parallel GiST builds
>> for the "unsorted" case (i.e. when the opclass does not include
>> sortsupport), and does not support buffered builds.
>
> I think this totally makes sense. I've took a look into tuples
> partitioning (for sorted build) in your Github and I see that it's
> complicated feature. So, probably, we can do it later. I'm trying to
> review the patch as it is now. Currently I have some questions about
> code.
OK. I'm not even sure partitioning is the right approach for sorted
builds. Or how to do it, exactly.
>
> 1. Do I get it right that is_parallel argument for gistGetFakeLSN()
> is only needed for assertion? And this assertion can be ensured just
> by inspecting code. Is it really necessary?
Yes, in the patch it's only used for an assert. But it's actually
incorrect - just as I speculated in my initial message (in the section
about gistGetFakeLSN), it sometimes fails to detect a page split. I
noticed that while testing the patch adding GiST to amcheck, and I
reported that in that thread [1]. But I apparently forgot to post an
updated version of this patch :-(
I'll post a new version tomorrow, but in short it needs to update the
fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
annoying this means we generate a new fake LSN on every call, and I'm
not sure that's actually necessary. But I've been unable to come up with
a better condition when to generate a new LSN.
[1]
https://www.postgresql.org/message-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf%40enterprisedb.com
> 2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
> not used anywhere. AFAIK it's only needed to buffered build. 3. I
> think we need a test that reliably triggers parallel and serial
> builds.
>
Yeah, it's possible the variable is unused. Agreed on the testing.
> As far as I know, there's a well known trick to build better GiST
> over PostGIS data: randomize input. I think parallel scan is just
> what is needed, it will shuffle tuples enough...
>
I'm not sure I understand this comment. What do you mean by "better
GiST" or what does that mean for this patch?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2024-07-21 21:04:42 | Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin |
Previous Message | Joe Conway | 2024-07-21 20:34:01 | Re: CI, macports, darwin version problems |