Re: scalability bottlenecks with (many) partitions (and more)

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Date: 2024-09-02 17:46:21
Message-ID: b8c43eda-0c3f-4cb4-809b-841fa5c40ada@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/2/24 01:53, Robert Haas wrote:
> On Sun, Sep 1, 2024 at 3:30 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> I don't think that's possible with hard-coded size of the array - that
>> allocates the memory for everyone. We'd need to make it variable-length,
>> and while doing those benchmarks I think we actually already have a GUC
>> for that - max_locks_per_transaction tells us exactly what we need to
>> know, right? I mean, if I know I'll need ~1000 locks, why not to make
>> the fast-path array large enough for that?
>
> I really like this idea. I'm not sure about exactly how many fast path
> slots you should get for what value of max_locks_per_transaction, but
> coupling the two things together in some way sounds smart.
>

I think we should keep that simple and make the cache large enough for
max_locks_per_transaction locks. That's the best information about
expected number of locks we have. If the GUC is left at the default
value, that probably means they backends need that many locks on
average. Yes, maybe there's an occasional spike in one of the backends,
but then that means other backends need fewer locks, and so there's less
contention for the shared lock table.

Of course, it's possible to construct counter-examples to this. Say a
single backend that needs a lot of these locks. But how's that different
from every other fixed-size cache with eviction?

The one argument to not tie this to max_locks_per_transaction is the
vastly different "per element" memory requirements. If you add one entry
to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
OTOH one fast-path entry is ~5B, give or take. That's a pretty big
difference, and it if the locks fit into the shared lock table, but
you'd like to allow more fast-path locks, having to increase
max_locks_per_transaction is not great - pretty wastefull.

OTOH I'd really hate to just add another GUC and hope the users will
magically know how to set it correctly. That's pretty unlikely, IMO. I
myself wouldn't know what a good value is, I think.

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

I think just knowing the "hit ratio" would be enough, i.e. counters for
how often it fits into the fast-path array, and how often we had to
promote it to the shared lock table would be enough, no?

>> Of course, the consequence of this would be making PGPROC variable
>> length, or having to point to a memory allocated separately (I prefer
>> the latter option, I think). I haven't done any experiments, but it
>> seems fairly doable - of course, not sure if it might be more expensive
>> compared to compile-time constants.
>
> I agree that this is a potential problem but it sounds like the idea
> works well enough that we'd probably still come out quite far ahead
> even with a bit more overhead.
>

OK, I did some quick tests on this, and I don't see any regressions.

Attached are 4 patches:

1) 0001 - original patch, with some minor fixes (remove init, which is
not necessary, that sort of thing)

2) 0002 - a bit of reworks, improving comments, structuring the macros a
little bit better, etc. But still compile-time constants.

3) 0003 - dynamic sizing, based on max_locks_pet_transaction. It's a bit
ugly, because the size is calculated during shmem allocation - it
should happen earlier, but good enough for PoC.

4) 0004 - introduce a separate GUC, this is mostly to allow testing of
different values without changing max_locks_per_transaction

I've only did that on my smaller 32-core machine, but for three simple
tests it looks like this (throughput using 16 clients):

mode test master 1 2 3 4
----------------------------------------------------------------
prepared count 1460 1477 1488 1490 1491
join 15556 24451 26044 25026 24237
pgbench 148187 151192 151688 150389 152681
----------------------------------------------------------------
simple count 1341 1351 1373 1374 1370
join 4643 5439 5459 5393 5345
pgbench 139763 141267 142796 141207 142600

Those are some simple benchmarks on 100 partitions, where the regular
pgbench and count(*) are expected to not be improved, and the join is
the partitioned join this thread started with. 1-4 are the attached
patches, to see the impact for each of them.

Translated to results relative to

mode test 1 2 3 4
-------------------------------------------------
prepared count 101% 102% 102% 102%
join 157% 167% 161% 156%
pgbench 102% 102% 101% 103%
-------------------------------------------------
simple count 101% 102% 102% 102%
join 117% 118% 116% 115%
pgbench 101% 102% 101% 102%

So pretty much no difference between the patches. A bit of noise, but
that's what I'd expect on this machine.

I'll do more testing on the bit EPYC machine once it gets available, but
from these results it seems pretty promising.

regards

--
Tomas Vondra

Attachment Content-Type Size
v20240902-0001-v1.patch text/x-patch 10.5 KB
v20240902-0002-rework.patch text/x-patch 11.1 KB
v20240902-0003-drive-this-by-max_locks_per_transaction.patch text/x-patch 9.7 KB
v20240902-0004-separate-guc-to-allow-benchmarking.patch text/x-patch 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-09-02 18:00:00 Re: Typos in the code and README
Previous Message Nitin Motiani 2024-09-02 15:49:41 Re: long-standing data loss bug in initial sync of logical replication