WIP: parallel GiST index builds

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: WIP: parallel GiST index builds
Date: 2024-06-07 17:41:10
Message-ID: 0e4f4af8-1088-4995-b2fb-8f92b5c6cef9@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

After looking into parallel builds for BRIN and GIN indexes, I was
wondering if there's a way to do parallel builds for GiST too. I knew
next to nothing about how GiST works, but I gave it a shot and here's
what I have - the attached patch allows parallel GiST builds for the
"unsorted" case (i.e. when the opclass does not include sortsupport),
and does not support buffered builds.

unsorted builds only
--------------------

Addressing only the unsorted case may seem a bit weird, but I did it
this way for two reasons - parallel sort is a solved problem, and adding
this to the patch seems quite straightforward. It's what btree does, for
example. But I also was not very sure how common this is - we do have
sort for points, but I have no idea if the PostGIS indexes define
sorting etc. My guess was "no" but I've been told that's no longer true,
so I guess sorted builds are more widely applicable than I thought.

In any case, I'm not in a rush to parallelize sorted builds. It can be
added later, as an improvement, IMHO. In fact, it's a well isolated part
of the patch, which might make it a good choice for someone looking for
an idea for their first patch ...

buffered builds
---------------

The lack of support for buffered builds is a very different thing. The
basic idea is that we don't push the index entries all the way to the
leaf pages right away, but accumulate them in buffers half-way through.
This combines writes and reduces random I/O, which is nice.

Unfortunately, the way it's implemented does not work with parallel
builds at all - all the state is in private memory, and it assumes the
worker is the only possible backend that can split the page (at which
point the buffers need to be split too, etc.). But for parallel builds
this is obviously not true.

I'm not saying parallel builds can't do similar buffering, but it
requires moving the buffers into shared memory, and introducing locking
to coordinate accesses to the buffers. (Or perhaps it might be enough to
only "notify" the workers about page splits, with buffers still in
private memory?). Anyway, it seems far too complicated for v1.

In fact, I'm not sure the buffering is entirely necessary - maybe the
increase in amount of RAM makes this less of an issue? If the index can
fit into shared buffers (or at least page cache), maybe the amount of
extra I/O is not that bad? I'm sure there may be cases really affected
by this, but maybe it's OK to tell people to disable parallel builds in
those cases?

gistGetFakeLSN
--------------

One more thing - GiST disables WAL-logging during the build, and only
logs it once at the end. For serial builds this is fine, because there
are no concurrent splits, and so we don't need to rely on page LSNs to
detect these cases (in fact, is uses a bogus value).

But for parallel builds this would not work - we need page LSNs that
actually change, otherwise we'd miss page splits, and the index build
would either fail or produce a broken index. But the existing is_build
flag affects both things, so I had to introduce a new "is_parallel" flag
which only affects the page LSN part, using the gistGetFakeLSN()
function, previously used only for unlogged indexes.

This means we'll produce WAL during the index build (because
gistGetFakeLSN() writes a trivial message into WAL). Compared to the
serial builds this produces maybe 25-75% more WAL, but it's an order of
magnitude less than with "full" WAL logging (is_build=false).

For example, serial build of 5GB index needs ~5GB of WAL. A parallel
build may need ~7GB, while a parallel build with "full" logging would
use 50GB. I think this is a reasonable trade off.

There's one "strange" thing, though - the amount of WAL decreases with
the number of parallel workers. Consider for example an index on a
numeric field, where the index is ~9GB, but the amount of WAL changes
like this (0 workers means serial builds):

parallel workers 0 1 3 5 7
WAL (GB) 5.7 9.2 7.6 7.0 6.8

The explanation for this is fairly simple (AFAIK) - gistGetFakeLSN
determines if it needs to actually assign a new LSN (and write stuff to
WAL) by comparing the last LSN assigned (in a given worker) to the
current insert LSN. But the current insert LSN might have been updated
by some other worker, in which case we simply use that. Which means that
multiple workers may use the same fake LSN, and the likelihood increases
with the number of workers - and this is consistent with the observed
behavior of the WAL decreasing as the number of workers increases
(because more workers use the same LSN).

I'm not entirely sure if this is OK or a problem. I was worried two
workers might end up using the same LSN for the same page, leading to
other workers not noticing the split. But after a week of pretty
intensive stress testing, I haven't seen a single such failure ...

If this turns out to be a problem, the fix is IMHO quite simple - it
should be enough to force gistGetFakeLSN() to produce a new fake LSN
every time when is_parallel=true.

performance
-----------

Obviously, the primary goal of the patch is to speed up the builds, so
does it actually do that? For indexes of different sizes I got this
timings (in seconds):

scale type 0 1 3 5 7
------------------------------------------------------------------
small inet 13 7 4 4 2
numeric 239 122 67 46 36
oid 15 8 5 3 2
text 71 35 19 13 10
medium inet 207 111 59 42 32
numeric 3409 1714 885 618 490
oid 214 114 60 43 33
text 940 479 247 180 134
large inet 2167 1459 865 632 468
numeric 38125 20256 10454 7487 5846
oid 2647 1490 808 594 475
text 10987 6298 3376 2462 1961

Here small is ~100-200MB index, medium is 1-2GB and large 10-20GB index,
depending on the data type.

The raw duration is not particularly easy to interpret, so let's look at
the "actual speedup" which is calculated as

(serial duration) / (parallel duration)

and the table looks like this:

scale type 1 3 5 7
--------------------------------------------------------------
small inet 1.9 3.3 3.3 6.5
numeric 2.0 3.6 5.2 6.6
oid 1.9 3.0 5.0 7.5
text 2.0 3.7 5.5 7.1
medium inet 1.9 3.5 4.9 6.5
numeric 2.0 3.9 5.5 7.0
oid 1.9 3.6 5.0 6.5
text 2.0 3.8 5.2 7.0
large inet 1.5 2.5 3.4 4.6
numeric 1.9 3.6 5.1 6.5
oid 1.8 3.3 4.5 5.6
text 1.7 3.3 4.5 5.6

Ideally (if the build scaled linearly with the number of workers), we'd
get the number of workers + 1 (because the leader participates too).
Obviously, it's not that great - for example for text with 3 workers we
get 3.3 instead of 4.0, and 5.6 vs. 8 with 7 workers.

But I think those numbers are actually pretty good - I'd definitely not
complain if my index builds got 5x faster.

But those are synthetic tests on random data, using the btree_gist
opclasses. It'd be interesting if people could do their own testing on
real-world data sets.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v20240607-0001-WIP-parallel-GiST-build.patch text/x-patch 36.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Neil Conway 2024-06-07 18:07:36 Re: Optimizing COPY with SIMD
Previous Message Jacob Champion 2024-06-07 17:14:50 Re: Re: Add support to TLS 1.3 cipher suites and curves lists