From: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru> |
---|---|
To: | Aliaksandr Kalenik <akalenik(at)kontur(dot)io>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: [PATCH] reduce page overlap of GiST indexes built using sorted method |
Date: | 2021-12-25 13:35:30 |
Message-ID: | 23731640439330@vla1-4ea76ba32639.qloud-c.yandex.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Aliaksandr!
Thanks for working on this!
> Benchmark summary:
>
> create index roads_rdr_idx on roads_rdr using gist (geom);
>
> with sort support before patch / CREATE INDEX 76.709 ms
>
> with sort support after patch / CREATE INDEX 225.238 ms
>
> without sort support / CREATE INDEX 446.071 ms
>
> select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
>
> with sort support before patch / SELECT 5766.526 ms
>
> with sort support after patch / SELECT 2646.554 ms
>
> without sort support / SELECT 2721.718 ms
>
> index size
>
> with sort support before patch / IDXSIZE 2940928 bytes
>
> with sort support after patch / IDXSIZE 4956160 bytes
>
> without sort support / IDXSIZE 5447680 bytes
The numbers are impressive, newly build index is actually performing better!
I've conducted some tests over points, not PostGIS geometry. For points build is 2x slower now, but this is the cost of faster scans.
Some strong points of this index building technology.
The proposed algorithm is not randomly chosen as anything that performs better than the original sorting build. We actually researched every idea we knew from literature and intuition. Although we consciously limited the search area by existing GiST API.
Stuff like penalty-based choose-subtree and split in equal halves seriously limit possible solutions. If anyone knows an any better way to build GiST faster or with better scan performance - please let us know.
The proposed algorithm contains the current algorithm as a special case: there is a parameter - the number of buffers accumulated before calling Split. If this parameter is 1 proposed algorithm will produce exactly the same output.
At this stage, we would like to hear some feedback from Postgres and PostGIS community. What other performance aspects should we test?
Current patch implementation has some known deficiencies:
1. We need a GUC instead of the hard-coded buffer of 8 pages.
2. Is GiST sorting build still deterministic? If not - we should add a fixed random seed into pageinspect tests.
3. It would make sense to check the resulting indexes with amcheck [0], although it's not committed.
4. We cannot make an exact fillfactor due to the behavior of picksplit. But can we improve anything here? I think if not - it's still OK.
5. GistSortedBuildPageState is no more page state. It's Level state or something like that.
6. The patch desperately needs comments.
Thanks!
Best regards, Andrey Borodin.
[0] https://www.postgresql.org/message-id/flat/59D0DA6B-1652-4D44-B0EF-A582D5824F83%40yandex-team.ru
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2021-12-25 17:20:36 | Re: Schema variables - new implementation for Postgres 15 |
Previous Message | Joel Jacobson | 2021-12-25 11:02:29 | Re: [PATCH] regexp_positions ( string text, pattern text, flags text ) → setof int4range[] |