Re: scalability bottlenecks with (many) partitions (and more)

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Date: 2024-09-03 16:19:45
Message-ID: 52f026c4-6c4e-4de2-bd78-609091d7c772@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/3/24 17:06, Robert Haas wrote:
> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> The one argument to not tie this to max_locks_per_transaction is the
>> vastly different "per element" memory requirements. If you add one entry
>> to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
>> OTOH one fast-path entry is ~5B, give or take. That's a pretty big
>> difference, and it if the locks fit into the shared lock table, but
>> you'd like to allow more fast-path locks, having to increase
>> max_locks_per_transaction is not great - pretty wastefull.
>>
>> OTOH I'd really hate to just add another GUC and hope the users will
>> magically know how to set it correctly. That's pretty unlikely, IMO. I
>> myself wouldn't know what a good value is, I think.
>>
>> But say we add a GUC and set it to -1 by default, in which case it just
>> inherits the max_locks_per_transaction value. And then also provide some
>> basic metric about this fast-path cache, so that people can tune this?
>
> All things being equal, I would prefer not to add another GUC for
> this, but we might need it.
>

Agreed.

> Doing some worst case math, suppose somebody has max_connections=1000
> (which is near the upper limit of what I'd consider a sane setting)
> and max_locks_per_transaction=10000 (ditto). The product is 10
> million, so every 10 bytes of storage each a gigabyte of RAM. Chewing
> up 15GB of RAM when you could have chewed up only 0.5GB certainly
> isn't too great. On the other hand, those values are kind of pushing
> the limits of what is actually sane. If you imagine
> max_locks_per_transaction=2000 rather than
> max_locks_per_connection=10000, then it's only 3GB and that's
> hopefully not a lot on the hopefully-giant machine where you're
> running this.
>

Yeah, although I don't quite follow the math. With 1000/10000 settings,
why would that eat 15GB of RAM? I mean, that's 1.5GB, right?

FWIW the actual cost is somewhat higher, because we seem to need ~400B
for every lock (not just the 150B for the LOCK struct). At least based
on a quick experiment. (Seems a bit high, right?).

Anyway, I agree this might be acceptable. If your transactions use this
many locks regularly, you probably need this setting anyway. If you only
need this many locks occasionally (so that you can keep the locks/xact
value low), it probably does not matter that much.

And if you're running massively-partitioned table on a tiny box, well, I
don't really think that's a particularly sane idea.

So I think I'm OK with just tying this to max_locks_per_transaction.

>> I think just knowing the "hit ratio" would be enough, i.e. counters for
>> how often it fits into the fast-path array, and how often we had to
>> promote it to the shared lock table would be enough, no?
>
> Yeah, probably. I mean, that won't tell you how big it needs to be,
> but it will tell you whether it's big enough.
>

True, but that applies to all "cache hit ratio" metrics (like for our
shared buffers). It'd be great to have something better, enough to tell
you how large the cache needs to be. But we don't :-(

> I wonder if we should be looking at further improvements in the lock
> manager of some kind. For instance, imagine if we allocated storage
> via DSM or DSA for cases where we need a really large number of Lock
> entries. The downside of that is that we might run out of memory for
> locks at runtime, which would perhaps suck, but you'd probably use
> significantly less memory on average. Or, maybe we need an even bigger
> rethink where we reconsider the idea that we take a separate lock for
> every single partition instead of having some kind of hierarchy-aware
> lock manager. I don't know. But this feels like very old, crufty tech.
> There's probably something more state of the art that we could or
> should be doing.
>

Perhaps. I agree we'll probably need something more radical soon, not
just changes that aim to fix some rare exceptional case (which may be
annoying, but not particularly harmful for the complete workload).

For example, if we did what you propose, that might help when very few
transactions need a lot of locks. I don't mind saving memory in that
case, ofc. but is it a problem if those rare cases are a bit slower?
Shouldn't we focus more on cases where many locks are common? Because
people are simply going to use partitioning, a lot of indexes, etc?

So yeah, I agree we probably need a more fundamental rethink. I don't
think we can just keep optimizing the current approach, there's a limit
of fast it can be. Whether it's not locking individual partitions, or
not locking some indexes, ... I don't know.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan S. Katz 2024-09-03 16:35:42 Large expressions in indexes can't be stored (non-TOASTable)
Previous Message Maciek Sakrejda 2024-09-03 15:58:51 Re: Broken layout: CommitFest Add Review Form