Re: scalability bottlenecks with (many) partitions (and more)

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Date: 2024-09-23 10:02:24
Message-ID: CAKZiRmw2F-qMdaZO-jWe6GLNa5ijquuRbNfvMR44Z_HsqtWhew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 16, 2024 at 4:19 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> On 9/16/24 15:11, Jakub Wartak wrote:
> > On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >
> >> [..]
> >
> >> Anyway, at this point I'm quite happy with this improvement. I didn't
> >> have any clear plan when to commit this, but I'm considering doing so
> >> sometime next week, unless someone objects or asks for some additional
> >> benchmarks etc.
> >
> > Thank you very much for working on this :)
> >
> > The only fact that comes to my mind is that we could blow up L2
> > caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
> > be like one or two 2MB huge pages more @ common max_connections=1000
> > x86_64 (830kB -> ~5.1MB), and indeed:
[..]
> > then maybe(?) one could observe further degradation of dTLB misses in
> > the perf-stat counter under some microbenchmark, but measuring that
> > requires isolated and physical hardware. Maybe that would be actually
> > noise due to overhead of context-switches itself. Just trying to think
> > out loud, what big PGPROC could cause here. But this is already an
> > unhealthy and non-steady state of the system, so IMHO we are good,
> > unless someone comes up with a better (more evil) idea.
> >
>
> I've been thinking about such cases too, but I don't think it can really
> happen in practice, because:
>
> - How likely is it that the sessions will need a lot of OIDs, but not
> the same ones? Also, why would it matter that the OIDs are not the same,
> I don't think it matters unless one of the sessions needs an exclusive
> lock, at which point the optimization doesn't really matter.
>
> - If having more fast-path slots means it doesn't fit into L2 cache,
> would we fit into L2 without it? I don't think so - if there really are
> that many locks, we'd have to add those into the shared lock table, and
> there's a lot of extra stuff to keep in memory (relcaches, ...).
>
> This is pretty much one of the cases I focused on in my benchmarking,
> and I'm yet to see any regression.

Sorry for answering this so late. Just for context here: I was
imagining a scenario with high max_connections about e.g. schema-based
multi-tenancy and no partitioning (so all would be fine without this
$thread/commit ; so under 16 (fast)locks would be taken). The OIDs
need to be different to avoid contention: so that futex() does not end
up really in syscall (just user-space part). My theory was that a much
smaller PGPROC should be doing much less (data) cache-line fetches
than with-the-patch. That hash() % prime , hits various parts of a
larger array - so without patch should be quicker as it wouldn't be
randomly hitting some larger array[], but it might be noise as you
state. It was a theoretical attempt at crafting the worst possible
conditions for the patch, so feel free to disregard as it already
assumes some anti-pattern (big & all active max_connections).

> > Well the only thing I could think of was to add to the
> > doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
> > is also used as advisory for the number of groups used in
> > lockmanager's fast-path implementation" (that is, without going into
> > further discussion, as even pg_locks discussion
> > doc/src/sgml/system-views.sgml simply uses that term).
> >
>
> Thanks, I'll consider mentioning this in max_locks_per_transaction.
> Also, I think there's a place calculating the amount of per-connection
> memory, so maybe that needs to be updated too.
>

I couldn't find it in current versions, but maybe that's helpful/reaffirming:
- up to 9.2. there were exact formulas used, see "(1800 + 270 *
max_locks_per_transaction) * max_connections" [1] , that's a long time
gone now.
- if anything then Andres might want to improve a little his blog
entry: [1] (my take is that is seems to be the most accurate and
authoritative technical information that we have online)

-J.

[1] - https://www.postgresql.org/docs/9.2/kernel-resources.html
[2] - https://blog.anarazel.de/2020/10/07/measuring-the-memory-overhead-of-a-postgres-connection/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-09-23 10:04:09 Re: ANALYZE ONLY
Previous Message Amit Kapila 2024-09-23 09:47:55 Re: Clock-skew management in logical replication