Quick Links

Re: scalability bottlenecks with (many) partitions (and more)

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
Cc:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: scalability bottlenecks with (many) partitions (and more)
Date:	2024-09-22 12:14:21
Message-ID:	9aa1009a-43a7-4652-9e85-51ba54de9b01@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 9/22/24 10:50, Ants Aasma wrote:
> On Sat, 21 Sept 2024 at 21:33, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> I've finally pushed this, after many rounds of careful testing to ensure
>> no regressions, and polishing. All changes since the version shared on
>> September 13 are only cosmetic - renaming a macro to keep it consistent
>> with the other ones, clarifying a couple comments etc. Nothing major.
>
> Great work on this, I have seen multiple customers hitting fast path
> capacity related LockManager contention. They will certainly be glad
> to have a fix available when they eventually upgrade. Regretfully I
> did not find the time to participate in this discussion during
> development. But I did have some thoughts that I wanted to unload to
> the list, not as a criticism, but in case it turns out follow up work
> is needed.
>
> Driving the array sizing from max_locks_per_transaction seems like a
> good idea. The one major difference from the lock table is that while
> the lock table is partitioned dynamically between backends, the fast
> path array has a static size per backend. One case where that
> distinction matters is when only a fraction of backends try to lock
> large numbers of relations. This fraction will still fall back to main
> lock tables, but at least the contention should be limited by virtue
> of not having too many of those backends. The other case is when
> max_connections is much higher than the number of backends actually
> used. Then backends may be consuming well over
> max_locks_per_transaction without running into lock table capacity
> issues.
>

I agree. I don't think the case with a couple lock-hungry backends
matters too much, because as you say there can't be too many of them, so
the contention should not be too bad. At least that was my reasoning.

Regarding the case with very high max_connection values - I doubt we
want to optimize for that very much. Extremely high max_connection
values are a clear anti-pattern (IMO), and if you choose to do that
anyway, you simply have to accept that connections have costs. The
memory for fast-path locking is one of those costs.

I'm not against improving that, ofc, but I think we should only do that
if it doesn't hurt the "reasonable" setups.

> In both cases users will have the simple workaround of just increasing
> the max_locks_per_transaction setting. Still, I'm sure they would be
> happier if things just worked without any tuning. So I tried to figure
> out some scheme to get dynamic allocation of fast path locks.
>

I agree with the premise that less tuning is better. Which is why we
tied this to max_locks_per_transaction.

> The best data structure I came up with was to have a shared fast path
> lock array. Still partitioned as a 16-way associative cache, but
> indexed by hash(BackendId, RelationId). fpLockBits can be stuffed into
> the high byte of BackendId thanks to MAX_BACKENDS. Locking could be
> handled by one lock per way, or at least on cursory glance it
> shouldn't be too difficult to convert the whole fast path acquisition
> to be lock free.
>
> Either way, it feels like structuring the array this way could result
> in a large amount of false sharing of cache lines. Current static
> allocation means that each process needs to touch only a small set of
> cache lines only referenced by itself - quite probable to keep those
> lines in CPU local L2 in exclusive mode. In a shared array a larger
> number of cache lines are needed and they will be concurrently written
> to by other backends - lots of invalidation messages and cache line
> bouncing. I don't know how large this effect will be without doing a
> prototype and running it on a large machine with high core-to-core
> latencies.
>

I don't have a very good intuition regarding cachelines. Ideally, the
backends would access disjunct parts of the array, so there should not
be a lot of false sharing. But maybe I'm wrong, hard to say without an
experimental patch.

> It would be possible to create a hybrid approach of a small local FP
> array servicing the majority of acquisitions with a larger shared
> victim cache for exceptional cases. But it doesn't feel like it is
> worth the complexity. At least not without seeing some example
> workloads where it would help. And even then, maybe using hierarchical
> locking to do less work is the better approach.
>

Not sure. My intuition would be to keep this as simple a possible.
Having a shared lock table and also a separate fast-path cache is
already sufficiently complex, adding cache for a cache seems a bit too
much to me.

> Being optimistic, perhaps the current patch was enough to resolve the issue.
>

It's an improvement. But if you want to give the shared fast-path cache
a try, go ahead - if you write a patch, I promise to review it.

regards

--
Tomas Vondra

In response to

Re: scalability bottlenecks with (many) partitions (and more) at 2024-09-22 08:50:31 from Ants Aasma

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Rowley	2024-09-22 13:09:09	Re: ANALYZE ONLY
Previous Message	Nitin Jadhav	2024-09-22 11:44:31	Re: Inconsistency in reporting checkpointer stats