Re: [PATCH] reduce page overlap of GiST indexes built using sorted method

From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Aliaksandr Kalenik <akalenik(at)kontur(dot)io>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: [PATCH] reduce page overlap of GiST indexes built using sorted method
Date: 2021-12-25 13:35:30
Message-ID: 23731640439330@vla1-4ea76ba32639.qloud-c.yandex.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi Aliaksandr!

Thanks for working on this!

> Benchmark summary:
>
> create index roads_rdr_idx on roads_rdr using gist (geom);
>
> with sort support before patch / CREATE INDEX 76.709 ms
>
> with sort support after patch / CREATE INDEX 225.238 ms
>
> without sort support / CREATE INDEX 446.071 ms
>
> select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
>
> with sort support before patch / SELECT 5766.526 ms
>
> with sort support after patch / SELECT 2646.554 ms
>
> without sort support / SELECT 2721.718 ms
>
> index size
>
> with sort support before patch / IDXSIZE 2940928 bytes
>
> with sort support after patch / IDXSIZE 4956160 bytes
>
> without sort support / IDXSIZE 5447680 bytes

The numbers are impressive, newly build index is actually performing better!
I've conducted some tests over points, not PostGIS geometry. For points build is 2x slower now, but this is the cost of faster scans.

Some strong points of this index building technology.
The proposed algorithm is not randomly chosen as anything that performs better than the original sorting build. We actually researched every idea we knew from literature and intuition. Although we consciously limited the search area by existing GiST API.
Stuff like penalty-based choose-subtree and split in equal halves seriously limit possible solutions. If anyone knows an any better way to build GiST faster or with better scan performance - please let us know.
The proposed algorithm contains the current algorithm as a special case: there is a parameter - the number of buffers accumulated before calling Split. If this parameter is 1 proposed algorithm will produce exactly the same output.

At this stage, we would like to hear some feedback from Postgres and PostGIS community. What other performance aspects should we test?

Current patch implementation has some known deficiencies:
1. We need a GUC instead of the hard-coded buffer of 8 pages.
2. Is GiST sorting build still deterministic? If not - we should add a fixed random seed into pageinspect tests.
3. It would make sense to check the resulting indexes with amcheck [0], although it's not committed.
4. We cannot make an exact fillfactor due to the behavior of picksplit. But can we improve anything here? I think if not - it's still OK.
5. GistSortedBuildPageState is no more page state. It's Level state or something like that.
6. The patch desperately needs comments.

Thanks!

Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/flat/59D0DA6B-1652-4D44-B0EF-A582D5824F83%40yandex-team.ru

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2021-12-25 17:20:36 Re: Schema variables - new implementation for Postgres 15
Previous Message Joel Jacobson 2021-12-25 11:02:29 Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[]