Re: WIP: parallel GiST index builds

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP: parallel GiST index builds
Date: 2024-07-01 20:20:41
Message-ID: c083882d-b406-4993-8121-f79ddb04d93a@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've done a number of experiments with the GiST parallel builds, both
with the sorted and unsorted cases, so let me share some of the results
and conclusions from that.

In the first post I did some benchmarks using btree_gist, but that
seemed not very realistic - there certainly are much more widely used
GiST indexes in the GIS world. So this time I used OpenStreetMap, loaded
using osm2pgsql, with two dataset sizes:

- small - "north america" (121GB without indexes)
- large - "planet" (688GB without indexes)

And then I created indexes using either gist_geometry_ops_2d (with sort)
or gist_geometry_ops_nd (no sorting).

On 6/7/24 19:41, Tomas Vondra wrote:
> Hi,
>
> After looking into parallel builds for BRIN and GIN indexes, I was
> wondering if there's a way to do parallel builds for GiST too. I knew
> next to nothing about how GiST works, but I gave it a shot and here's
> what I have - the attached patch allows parallel GiST builds for the
> "unsorted" case (i.e. when the opclass does not include sortsupport),
> and does not support buffered builds.
>
>
> unsorted builds only
> --------------------
>
> Addressing only the unsorted case may seem a bit weird, but I did it
> this way for two reasons - parallel sort is a solved problem, and adding
> this to the patch seems quite straightforward. It's what btree does, for
> example. But I also was not very sure how common this is - we do have
> sort for points, but I have no idea if the PostGIS indexes define
> sorting etc. My guess was "no" but I've been told that's no longer true,
> so I guess sorted builds are more widely applicable than I thought.

For sorted builds, I made the claim that parallelizing sorted builds is
"solved problem" because we can use a parallel tuplesort. I was thinking
that maybe it'd be better to do that in the initial patch, and only then
introduce the more complex stuff in the unsorted case, so I gave this a
try, and it turned to be rather pointless.

Yes, parallel tuplesort does improve the duration, but it's not a very
significant improvement - maybe 10% or so. Most of the build time is
spent in gist_indexsortbuild(), so this is the part that would need to
be parallelized for any substantial improvement. Only then is it useful
to improve the tuplesort, I think.

And parallelizing gist_indexsortbuild() is not trivial - most of the
time is spent in gist_indexsortbuild_levelstate_flush() / gistSplit(),
so ISTM a successful parallel implementation would need to divide this
work between multiple workers. I don't have a clear idea how, though.

I do have a PoC/WIP patch doing the paralle tuplesort in my github
branch at [1] (and then also some ugly experiments on top of that), but
I'm not going to attach it here because of the reasons I just explained.
It'd be just a pointless distraction.

> In any case, I'm not in a rush to parallelize sorted builds. It can be
> added later, as an improvement, IMHO. In fact, it's a well isolated part
> of the patch, which might make it a good choice for someone looking for
> an idea for their first patch ...
>

I still think this assessment is correct - it's fine to not parallelize
sorted builds. It can be improved in the future, or even not at all.

>
> buffered builds
> ---------------
>
> The lack of support for buffered builds is a very different thing. The
> basic idea is that we don't push the index entries all the way to the
> leaf pages right away, but accumulate them in buffers half-way through.
> This combines writes and reduces random I/O, which is nice.
>
> Unfortunately, the way it's implemented does not work with parallel
> builds at all - all the state is in private memory, and it assumes the
> worker is the only possible backend that can split the page (at which
> point the buffers need to be split too, etc.). But for parallel builds
> this is obviously not true.
>
> I'm not saying parallel builds can't do similar buffering, but it
> requires moving the buffers into shared memory, and introducing locking
> to coordinate accesses to the buffers. (Or perhaps it might be enough to
> only "notify" the workers about page splits, with buffers still in
> private memory?). Anyway, it seems far too complicated for v1.
>
> In fact, I'm not sure the buffering is entirely necessary - maybe the
> increase in amount of RAM makes this less of an issue? If the index can
> fit into shared buffers (or at least page cache), maybe the amount of
> extra I/O is not that bad? I'm sure there may be cases really affected
> by this, but maybe it's OK to tell people to disable parallel builds in
> those cases?
>

For unsorted builds, here's the results from one of the machines for
duration of CREATE INDEX with the requested number of workers (0 means
serial build) for different tables in the OSM databases:

db type size (MB) | 0 1 2 3 4
-----------------------------|----------------------------------
small line 4889 | 811 429 294 223 186
point 2625 | 485 262 179 141 125
polygon 7644 | 1230 623 418 318 261
roads 273 | 40 22 16 14 12
-----------------------------|----------------------------------
large line 20592 | 3916 2157 1479 1137 948
point 13080 | 2636 1442 981 770 667
polygon 50598 | 10990 5648 3860 2991 2504
roads 1322 | 228 123 85 67 56

There's also the size of the index. If we calculate the speedup compared
to serial build, we get this:

db type | 1 2 3 4
-----------------|--------------------------------
small line | 1.9 2.8 3.6 4.4
point | 1.9 2.7 3.4 3.9
polygon | 2.0 2.9 3.9 4.7
roads | 1.8 2.5 2.9 3.3
-----------------|--------------------------------
large line | 1.8 2.6 3.4 4.1
point | 1.8 2.7 3.4 4.0
polygon | 1.9 2.8 3.7 4.4
roads | 1.9 2.7 3.4 4.1

Remember, the leader participates in the build, so K workers means K+1
processes are doing the work. And the speedup is pretty close to the
ideal speedup.

There's the question about buffering, though - as mentioned in the first
patch, the parallel builds do not support buffering, so the question is
how bad is the impact of that. Clearly, the duration improves a lot, so
that's good, but maybe it did write out far more buffers and the NVMe
drive handled it well?

So I used pg_stat_statements to track the number of buffer writes
(shared_blks_written) for the CREATE INDEX, and for the large data set
it looks like this (this is in MBs written):

size | 0 1 2 3 4
----------------|--------------------------------------------
line 20592 | 43577 47580 49574 50388 50734
point 13080 | 23331 25721 26399 26745 26889
polygon 50598 | 113108 125095 129599 130170 131249
roads 1322 | 1322 1310 1305 1300 1295

The serial builds (0 workers) are buffered, but the buffering only
applies for indexes that exceed effective_cache_size (4GB). Which means
the "roads" buffer is too small to activate buffering, and there should
be very little differences - which is the case (but the index also fits
into shared buffers in this case).

The other indexes do activate buffering, so the question is how many
more buffers get written out compared to serial builds (with buffering).
And the comparison looks like this:

1 2 3 4
------------------------------------------
line 109% 114% 116% 116%
point 110% 113% 115% 115%
polygon 111% 115% 115% 116%
roads 99% 99% 98% 98%

So it writes about 15-20% more buffers during the index build, which is
not that much IMHO. I was wondering if this might change with smaller
shared buffers, so I tried building indexes on the smaller data set with
128MB shared buffers, but the difference remained to be ~15-20%.

My conclusion from this is that it's OK to have parallel builds without
buffering, and then maybe improve that later. The thing I'm not sure
about is how this should interact with the "buffering" option. Right now
we just ignore that entirely if we decide to do parallel build. But
maybe it'd be better to disable parallel builds when the user specifies
"buffering=on" (and only allow parallel builds with off/auto)?

I did check how parallelism affects the amount of WAL produced, but
that's pretty much exactly how I described that in the initial message,
including the "strange" decrease with more workers due to reusing the
fake LSN etc.

regards

[1] https://github.com/tvondra/postgres/tree/parallel-gist-20240625

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v20240701-0001-WIP-parallel-GiST-build.patch text/x-patch 36.7 KB
gist-i5.tgz application/x-compressed-tar 19.9 KB
gist-xeon.tgz application/x-compressed-tar 13.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2024-07-01 20:52:50 Re: Relation bulk write facility
Previous Message Kirill Reshke 2024-07-01 20:16:31 Re: Commitfest manager for July 2024