From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Improving connection scalability (src/backend/storage/ipc/procarray.c) |
Date: | 2022-05-28 12:35:00 |
Message-ID: | 3c1eb4e2-18ec-f0cf-c1dc-6eb8a6e70e31@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 5/28/22 02:36, Ranier Vilela wrote:
> Em sex., 27 de mai. de 2022 às 18:22, Andres Freund <andres(at)anarazel(dot)de
> <mailto:andres(at)anarazel(dot)de>> escreveu:
>
> Hi,
>
> On 2022-05-27 10:35:08 -0300, Ranier Vilela wrote:
> > Em qui., 26 de mai. de 2022 às 22:30, Tomas Vondra <
> > tomas(dot)vondra(at)enterprisedb(dot)com
> <mailto:tomas(dot)vondra(at)enterprisedb(dot)com>> escreveu:
> >
> > > On 5/27/22 02:11, Ranier Vilela wrote:
> > > >
> > > > ...
> > > >
> > > > Here the results with -T 60:
> > >
> > > Might be a good idea to share your analysis / interpretation of the
> > > results, not just the raw data. After all, the change is being
> proposed
> > > by you, so do you think this shows the change is beneficial?
> > >
> > I think so, but the expectation has diminished.
> > I expected that the more connections, the better the performance.
> > And for both patch and head, this doesn't happen in tests.
> > Performance degrades with a greater number of connections.
>
> Your system has four CPUs. Once they're all busy, adding more
> connections
> won't improve performance. It'll just add more and more context
> switching,
> cache misses, and make the OS scheduler do more work.
>
> conns tps head
> 10 82365.634750
> 50 74593.714180
> 80 69219.756038
> 90 67419.574189
> 100 66613.771701
> Yes it is quite disappointing that with 100 connections, tps loses to 10
> connections.
>
IMO that's entirely expected on a system with only 4 cores. Increasing
the number of connections inevitably means more overhead (you have to
track/manage more stuff). And at some point the backends start competing
for L2/L3 caches, context switches are not free either, etc. So once you
cross ~2-3x the number of cores, you should expect this.
This behavior is natural/inherent, it's unlikely to go away, and it's
one of the reasons why we recommend not to use too many connections. If
you try to maximize throughput, just don't do that. Or just use machine
with more cores.
>
> > GetSnapShowData() isn't a bottleneck?
>
> I'd be surprised if it showed up in a profile on your machine with that
> workload in any sort of meaningful way. The snapshot reuse logic
> will always
> work - because there are no writes - and thus the only work that
> needs to be
> done is to acquire the ProcArrayLock briefly. And because there is
> only a
> small number of cores, contention on the cacheline for that isn't a
> problem.
>
> Thanks for sharing this.
>
>
>
>
> > > These results look much saner, but IMHO it also does not show
> any clear
> > > benefit of the patch. Or are you still claiming there is a benefit?
> > >
> > We agree that they are micro-optimizations. However, I think they
> should be
> > considered micro-optimizations in inner loops, because all in
> procarray.c is
> > a hotpath.
>
> As explained earlier, I don't agree that they optimize anything - you're
> making some of the scalability behaviour *worse*, if it's changed at
> all.
>
>
> > The first objective, I believe, was achieved, with no performance
> > regression.
> > I agree, the gains are small, by the tests done.
>
> There are no gains.
>
> IMHO, I must disagree.
>
You don't have to, really. What you should do is showing results
demonstrating the claimed gains, and so far you have not done that.
I don't want to be rude, but so far you've shown results from a
benchmark testing fork(), due to only running 10 transactions per
client, and then results from a single run for each client count (which
doesn't really show any gains either, and is so noisy).
As mentioned GetSnapshotData() is not even in perf profile, so why would
the patch even make a difference?
You've also claimed it helps generating better code on older compilers,
but you've never supported that with any evidence.
Maybe there is an improvement - show us. Do a benchmark with more runs,
to average-out the noise. Calculate VAR/STDEV to show how variable the
results are. Use that to compare results and decide if there is an
improvement. Also, keep in mind binary layout matters [1].
[1] https://www.youtube.com/watch?v=r-TLSBdHe1A
>
>
>
> > But, IMHO, this is a good way, small gains turn into big gains in
> the end,
> > when applied to all code.
> >
> > Consider GetSnapShotData()
> > 1. Most of the time the snapshot is not null, so:
> > if (snaphost == NULL), will fail most of the time.
> >
> > With the patch:
> > if (snapshot->xip != NULL)
> > {
> > if (GetSnapshotDataReuse(snapshot))
> > return snapshot;
> > }
> >
> > Most of the time the test is true and GetSnapshotDataReuse is not
> called
> > for new
> > snapshots.
> > count, subcount and suboverflowed, will not be initialized, for all
> > snapshots.
>
> But that's irrelevant. There's only a few "new" snapshots in the
> life of a
> connection. You're optimizing something irrelevant.
>
> IMHO, when GetSnapShotData() is the bottleneck, all is relevant.
>
Maybe. Show us the difference.
>
>
>
> > 2. If snapshot is taken during recoverys
> > The pgprocnos and ProcGlobal->subxidStates are not touched
> unnecessarily.
>
> That code isn't reached when in recovery?
>
> Currently it is reached *even* when not in recovery.
> With the patch, *only* is reached when in recovery.
>
>
>
> > 3. Calling GetSnapshotDataReuse() without first acquiring
> ProcArrayLock.
> > There's an agreement that this would be fine, for now.
>
> There's no such agreement at all. It's not correct.
>
> Ok, but there is a chance it will work correctly.
>
Either it's correct or not. Chance of being correct does not count.
>
>
> > Consider ComputeXidHorizons()
> > 1. ProcGlobal->statusFlags is touched before the lock.
>
> Hard to believe that'd have a measurable effect.
>
> IMHO, anything you take out of the lock is a benefit.
>
Maybe. Show us the difference.
>
>
>
> > 2. allStatusFlags[index] is not touched for all numProcs.
>
> I'd be surprised if the compiler couldn't defer that load on its own.
>
> Better be sure of that, no?
>
We rely on compilers doing this in about a million other places.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2022-05-28 12:50:18 | Re: Assert name/short_desc to prevent SHOW ALL segfault |
Previous Message | Tomas Vondra | 2022-05-28 12:00:16 | Re: Improving connection scalability (src/backend/storage/ipc/procarray.c) |