Re: scalability bottlenecks with (many) partitions (and more)

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Date: 2024-06-25 10:04:14
Message-ID: 9266c8df-0df7-4605-ba5a-b204c5e52586@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/24/24 17:05, Robert Haas wrote:
> On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
>> LWLock table has 16 partitions by default - it's quite possible that on
>> machine with many cores and/or many partitions, we can easily hit this.
>> So I bumped this 4x to 64 partitions.
>
> I think this probably makes sense. I'm a little worried that we're
> just kicking the can down the road here where maybe we should be
> solving the problem in some more fundamental way, and I'm also a
> little worried that we might be reducing single-core performance. But
> it's probably fine.
>

Yeah, I haven't seen this causing any regressions - the sensitive paths
typically lock only one partition, so having more of them does not
affect that. Or if it does, it's likely a reasonable trade off as it
reduces the risk of lock contention.

That being said, I don't recall benchmarking this patch in isolation,
only with the other patches. Maybe I should do that ...

>> What I ended up doing is having a hash table of 16-element arrays. There
>> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
>> have now. Each OID is mapped to exactly one of these parts as if in a
>> hash table, and in each of those 16-element parts we do exactly the same
>> thing we do now (linear search, removal, etc.). This works great, the
>> locality is great, etc. The one disadvantage is this makes PGPROC
>> larger, but I did a lot of benchmarks and I haven't seen any regression
>> that I could attribute to this. (More about this later.)
>
> I think this is a reasonable approach. Some comments:
>
> - FastPathLocalUseInitialized seems unnecessary to me; the contents of
> an uninitialized local variable are undefined, but an uninitialized
> global variable always starts out zeroed.
>

OK. I didn't realize global variables start a zero.

> - You need comments in various places, including here, where someone
> is certain to have questions about the algorithm and choice of
> constants:
>
> +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481)
> % FP_LOCK_GROUPS_PER_BACKEND)
>

Yeah, definitely needs comment explaining this.

I admit those numbers are pretty arbitrary primes, to implement a
trivial hash function. That was good enough for a PoC patch, but maybe
for a "proper" version this should use a better hash function. It needs
to be fast, and maybe it doesn't matter that much if it's not perfect.

> When I originally coded up the fast-path locking stuff, I supposed
> that we couldn't make the number of slots too big because the
> algorithm requires a linear search of the whole array. But with this
> one trick (a partially-associative cache), there's no real reason that
> I can think of why you can't make the number of slots almost
> arbitrarily large. At some point you're going to use too much memory,
> and probably before that point you're going to make the cache big
> enough that it doesn't fit in the CPU cache of an individual core, at
> which point possibly it will stop working as well. But honestly ... I
> don't quite see why this approach couldn't be scaled quite far.
>

I don't think I've heard the term "partially-associative cache" before,
but now that I look at the approach again, it very much reminds me how
set-associative caches work (e.g. with cachelines in CPU caches). It's a
16-way associative cache, assigning each entry into one of 16 slots.

I must have been reading some papers in this area shortly before the PoC
patch, and the idea came from there, probably. Which is good, because it
means it's a well-understood and widely-used approach.

> Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value
> of 64 to say 65536, would that still perform well? I'm not saying we
> should do that, because that's probably a silly amount of memory to
> use for this, but the point is that when you start to have enough
> partitions that you run out of lock slots, performance is going to
> degrade, so you can imagine wanting to try to have enough lock groups
> to make that unlikely. Which leads me to wonder if there's any
> particular number of lock groups that is clearly "too many" or whether
> it's just about how much memory we want to use.
>

That's an excellent question. I don't know.

I agree 64 groups is pretty arbitrary, and having 1024 may not be enough
even with a modest number of partitions. When I was thinking about using
a higher value, my main concern was that it'd made the PGPROC entry
larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that
felt like quite a bit.

But maybe it's fine and we could make it much larger - L3 caches tend to
be many MBs these days, although AFAIK it's shared by threads running on
the CPU.

I'll see if I can do some more testing of this, and see if there's a
value where it stops improving / starts degrading, etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2024-06-25 10:09:01 Re: Conflict Detection and Resolution
Previous Message Amit Kapila 2024-06-25 09:42:36 Re: Conflict Detection and Resolution