Re: scalability bottlenecks with (many) partitions (and more)

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Date: 2024-08-05 11:38:31
Message-ID: 52afd566-d110-43ea-ae72-69349d4414f6@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 6/25/24 12:04, Tomas Vondra wrote:
>
>
> On 6/24/24 17:05, Robert Haas wrote:
>> On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
>> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
>>> LWLock table has 16 partitions by default - it's quite possible that on
>>> machine with many cores and/or many partitions, we can easily hit this.
>>> So I bumped this 4x to 64 partitions.
>>
>> I think this probably makes sense. I'm a little worried that we're
>> just kicking the can down the road here where maybe we should be
>> solving the problem in some more fundamental way, and I'm also a
>> little worried that we might be reducing single-core performance. But
>> it's probably fine.
>>
>
> Yeah, I haven't seen this causing any regressions - the sensitive paths
> typically lock only one partition, so having more of them does not
> affect that. Or if it does, it's likely a reasonable trade off as it
> reduces the risk of lock contention.
>
> That being said, I don't recall benchmarking this patch in isolation,
> only with the other patches. Maybe I should do that ...
>
>>> What I ended up doing is having a hash table of 16-element arrays. There
>>> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
>>> have now. Each OID is mapped to exactly one of these parts as if in a
>>> hash table, and in each of those 16-element parts we do exactly the same
>>> thing we do now (linear search, removal, etc.). This works great, the
>>> locality is great, etc. The one disadvantage is this makes PGPROC
>>> larger, but I did a lot of benchmarks and I haven't seen any regression
>>> that I could attribute to this. (More about this later.)
>>
>> I think this is a reasonable approach. Some comments:
>>
>> - FastPathLocalUseInitialized seems unnecessary to me; the contents of
>> an uninitialized local variable are undefined, but an uninitialized
>> global variable always starts out zeroed.
>>
>
> OK. I didn't realize global variables start a zero.
>

I haven't fixed this yet, but it's pretty clear the "init" is not really
needed, because it did the memset() wrong:

memset(FastPathLocalUseCounts, 0, sizeof(FastPathLocalUseInitialized));

This only resets one byte of the array, yet it still worked correctly.

>> - You need comments in various places, including here, where someone
>> is certain to have questions about the algorithm and choice of
>> constants:
>>
>> +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
>> % FP_LOCK_GROUPS_PER_BACKEND)
>>
>
> Yeah, definitely needs comment explaining this.
>
> I admit those numbers are pretty arbitrary primes, to implement a
> trivial hash function. That was good enough for a PoC patch, but maybe
> for a "proper" version this should use a better hash function. It needs
> to be fast, and maybe it doesn't matter that much if it's not perfect.
>
>> When I originally coded up the fast-path locking stuff, I supposed
>> that we couldn't make the number of slots too big because the
>> algorithm requires a linear search of the whole array. But with this
>> one trick (a partially-associative cache), there's no real reason that
>> I can think of why you can't make the number of slots almost
>> arbitrarily large. At some point you're going to use too much memory,
>> and probably before that point you're going to make the cache big
>> enough that it doesn't fit in the CPU cache of an individual core, at
>> which point possibly it will stop working as well. But honestly ... I
>> don't quite see why this approach couldn't be scaled quite far.
>>
>
> I don't think I've heard the term "partially-associative cache" before,
> but now that I look at the approach again, it very much reminds me how
> set-associative caches work (e.g. with cachelines in CPU caches). It's a
> 16-way associative cache, assigning each entry into one of 16 slots.
>
> I must have been reading some papers in this area shortly before the PoC
> patch, and the idea came from there, probably. Which is good, because it
> means it's a well-understood and widely-used approach.
>
>> Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
>> of 64 to say 65536, would that still perform well? I'm not saying we
>> should do that, because that's probably a silly amount of memory to
>> use for this, but the point is that when you start to have enough
>> partitions that you run out of lock slots, performance is going to
>> degrade, so you can imagine wanting to try to have enough lock groups
>> to make that unlikely. Which leads me to wonder if there's any
>> particular number of lock groups that is clearly "too many" or whether
>> it's just about how much memory we want to use.
>>
>
> That's an excellent question. I don't know.
>
> I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
> even with a modest number of partitions. When I was thinking about using
> a higher value, my main concern was that it'd made the PGPROC entry
> larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
> felt like quite a bit.
>
> But maybe it's fine and we could make it much larger - L3 caches tend to
> be many MBs these days, although AFAIK it's shared by threads running on
> the CPU.
>
> I'll see if I can do some more testing of this, and see if there's a
> value where it stops improving / starts degrading, etc.
>

I finally got to do those experiments. The scripts and results (both raw
and summarized) are too big to attach everything here, available at

https://github.com/tvondra/scalability-tests

The initial patch used 64 (which means 1024 fast-path slots), I ran the
tests with 0, 1, 8, 32, 128, 512, 1024 (so up to 16k locks). I thought
about testing with ~64k groups, but I didn't go with the extreme value
because I don't quite see the point.

It would only matter for cases with a truly extreme number of partitions
(64k groups is ~1M fast-path slots), and just creating enough partitions
would take a lot of time. Moreover, with that many partitions we seems
to have various other bottlenecks, and improving this does not make it
really practical. And it's so slow the benchmark results are somewhat
bogus too.

Because if we achieve 50 tps with 1000 partitions, does it really matter
a patch changes that to 25 of 100 tps? I doubt that, especially if going
to 100 partitions gives you 2000 tps. Now imagine you have 10k or 100k
partitions - how fast is that going to be?

So I think stopping at 1024 groups is sensible, and if there are some
inefficiencies / costs, I'd expect those to gradually show up even at
those lower sizes.

But if you look at results, for example from the "join" test:

https://github.com/tvondra/scalability-tests/blob/main/join.pdf

there's no such negative effect. the table shows results for different
combinations of parameters, with the first group of columns being on
regular glibc, the second one has glibc tuning (see [1] for details).
And the values are for different number of fast-path groups (0 means the
patch was not applied).

And the color scale on the show the impact of increasing the number of
groups. So for example when a column for "32 groups" says 150%, it means
going from 8 to 32 groups improved throughput to 1.5x. As usual, green
is "good" and red is "bad".

But if you look at the tables, there's very little change - most of the
values are close to 100%. This might seem a bit strange, considering the
promise of these patches is to improve throughput, and "no change" is an
absence of that. But that's because the charts illustrate effect of
changing the group count with other parameters fixed. It never compares
runs with/without glibc runing, and that's an important part of the
improvement. Doing the pivot table a bit differently would still show a
substantial 2-3x improvement.

There's a fair amount of noise - especially for the rpi5 machines (not
the right hw for sensitive benchmarks), but also on some i5/xeon runs. I
attribute this to only doing one short run (10s) for each combinations
of parameters. I'll do more runs next time.

Anyway, I think these results show a couple things:

1) There's no systemic negative effect of increasing the number of
groups. We could go with 32k or 64k groups, and it doesn't seem like
there would be a problem.

2) But there's not much point in doing that, because we run into various
other bottlenecks well before having that many locks. By the results, it
doesn't seem going beyond 32 or 64 groups would give us much.

3) The memory allocation caching (be it the mempool patch, or the glibc
tuning like in this test round) is a crucial piece for this. Not doing
that means some tests get no improvement at all, or a much smaller one.

4) The increase of NUM_LOCK_PARTITIONS has very limited effect, or
perhaps even no effect at all.

Based on this, my plan is to polish the patch adding fast-path groups,
with either 32 or 64 groups, which seems to be reasonable values. Then
in the future, if/when the other bottlenecks get addressed, we can
rethink and increase this.

This however reminds me that all those machines are pretty small. Which
is good for showing it doesn't break existing/smaller systems, but the
initial goal of the patch was to improve behavior on big boxes. I don't
have access to the original box at the moment, so if someone could
provide an access to one of those big epyc/xeon machines with 100+ cores
for a couple days, that would be helpful.

That being said, I think it's pretty clear how serious the issue with
memory allocation overhead can be, especially in cases when the existing
memory context caching is ineffective (like for the btree palloc). I'm
not sure what to do about that. The mempool patch shared in this thread
does the trick, it's fairly complex/invasive. I still like it, but maybe
doing something with the glibc tuning would be enough - it's not as
effective, but 80% of the improvement is better than no improvement.

regards

[1]
https://www.postgresql.org/message-id/0da51f67-c88b-497e-bb89-d5139309eb9c@enterprisedb.com

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-08-05 11:45:00 Re: Adding OLD/NEW support to RETURNING
Previous Message David Rowley 2024-08-05 11:26:23 Re: Speed up JSON escape processing with SIMD plus other optimisations