Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Matt Smiley <msmiley(at)gitlab(dot)com>
Cc: Nikolay Samokhvalov <nik(at)postgres(dot)ai>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
Date: 2023-08-06 14:44:50
Message-ID: 52dc1972-e3cc-ec26-8b88-c7a7bbce9044@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8/3/23 22:39, Tomas Vondra wrote:
> On 8/3/23 01:51, Matt Smiley wrote:
>> I thought it might be helpful to share some more details from one of the
>> case studies behind Nik's suggestion.
>>
>> Bursty contention on lock_manager lwlocks recently became a recurring
>> cause of query throughput drops for GitLab.com, and we got to study the
>> behavior via USDT and uprobe instrumentation along with more
>> conventional observations (see
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301
>> <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301>). 
>> This turned up some interesting finds, and I thought sharing some of
>> that research might be helpful.
>>
>
> The analysis in the linked gitlab issue is pretty amazing. I wasn't
> planning to argue against the findings anyway, but plenty of data
> supporting the conclusions is good.
>
> I'm not an expert on locking, so some of the stuff I say may be
> trivially obvious - it's just me thinking about ...
>
> I wonder what's the rough configuration of those systems, though. Both
> the hardware and PostgreSQL side. How many cores / connections, etc.?
>
>
>> Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would
>> have a much larger positive impact than any other mitigation strategy we
>> have evaluated.  Rather than reducing hold duration or collision rate,
>> adding fastpath slots reduces the frequency of even having to acquire
>> those lock_manager lwlocks.  I suspect this would be helpful for many
>> other workloads, particularly those having high frequency queries whose
>> tables collectively have more than about 16 or indexes.
>>
>
> Yes, I agree with that. Partitioning makes this issue works, I guess.
> Schemas with indexes on every column are disturbingly common these days
> too, which hits the issue too ...
>
>> Lowering the lock_manager lwlock acquisition rate means lowering its
>> contention rate (and probably also its contention duration, since
>> exclusive mode forces concurrent lockers to queue).
>>
>> I'm confident this would help our workload, and I strongly suspect it
>> would be generally helpful by letting queries use fastpath locking more
>> often.
>>
>
> OK
>
>>> However, the lmgr/README says this is meant to alleviate contention on
>>> the lmgr partition locks. Wouldn't it be better to increase the number
>>> of those locks, without touching the PGPROC stuff?
>>
>> That was my first thought too, but growing the lock_manager lwlock
>> tranche isn't nearly as helpful.
>>
>> On the slowpath, each relation's lock tag deterministically hashes onto
>> a specific lock_manager lwlock, so growing the number of lock_manager
>> lwlocks just makes it less likely for two or more frequently locked
>> relations to hash onto the same lock_manager.
>>
>
> Hmmm, so if we have a query that joins 16 tables, or a couple tables
> with indexes, all backends running this will acquire exactly the same
> partition locks. And we're likely acquiring them in exactly the same
> order (to lock objects in the same order because of deadlocks), making
> the issue worse.
>
>> In contrast, growing the number of fastpath slots completely avoids
>> calls to the slowpath (i.e. no need to acquire a lock_manager lwlock).
>>
>> The saturation condition we'd like to solve is heavy contention on one
>> or more of the lock_manager lwlocks.  Since that is driven by the
>> slowpath acquisition rate of heavyweight locks, avoiding the slowpath is
>> better than just moderately reducing the contention on the slowpath.
>>
>> To be fair, increasing the number of lock_manager locks definitely can
>> help to a certain extent, but it doesn't cover an important general
>> case.  As a thought experiment, suppose we increase the lock_manager
>> tranche to some arbitrarily large size, larger than the number of
>> relations in the db.  This unrealistically large size means we have the
>> best case for avoiding collisions -- each relation maps uniquely onto
>> its own lock_manager lwlock.  That helps a lot in the case where the
>> workload is spread among many non-overlapping sets of relations.  But it
>> doesn't help a workload where any one table is accessed frequently via
>> slowpath locking.
>>
>
> Understood.
>
>> Continuing the thought experiment, if that frequently queried table has
>> 16 or more indexes, or if it is joined to other tables that collectively
>> add up to over 16 relations, then each of those queries is guaranteed to
>> have to use the slowpath and acquire the deterministically associated
>> lock_manager lwlocks.
>>
>> So growing the tranche of lock_manager lwlocks would help for some
>> workloads, while other workloads would not be helped much at all.  (As a
>> concrete example, a workload at GitLab has several frequently queried
>> tables with over 16 indexes that consequently always use at least some
>> slowpath locks.)
>>
>> For additional context:
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate>
>> Summarizes the pathology and its current mitigations.
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678>
>> Documents the supporting research methodology.
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510>
>> What code paths require an exclusive mode lwlock for lock_manager?
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142>
>> Comparison of fastpath vs. slowpath locking, including quantifying the
>> rate difference.
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726 <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726>
>> Confirms the acquisition rate of lock_manager locks is not uniform.  The
>> sampled workload has a 3x difference in the most vs. least frequently
>> acquired lock_manager lock, corresponding to the workload's most
>> frequently accessed relations.
>>
>
> Those are pretty great pieces of information. I wonder if some of the
> measurements may be affecting the observation (by consuming too much
> CPU, making the contention worse), but overall it seems convincing.
>
> Would it be difficult to sample just a small fraction of the calls? Say,
> 1%, to get good histograms/estimated with acceptable CPU usage.
>
> In any case, it's a great source of information to reproduce the issue
> and evaluate possible fixes.
>
>>> Well, that has a cost too, as it makes PGPROC larger, right? At the
>>> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs
>>> would make it +192B / +3 cachelines. I doubt that won't impact other
>>> common workloads ...
>>
>> That's true; growing the data structure may affect L2/L3 cache hit rates
>> when touching PGPROC.  Is that cost worth the benefit of using fastpath
>> for a higher percentage of table locks?  The answer may be workload- and
>> platform-specific.  Exposing this as a GUC gives the admin a way to make
>> a different choice if our default (currently 16) is bad for them.
>>
>
> After looking at the code etc. I think the main trade-off here is going
> to be the cost of searching the fpRelId array. At the moment it's
> searched linearly, which is cheap for 16 locks. But at some point it'll
> become as expensive as updating the slowpath, and the question is when.
>
> I wonder if we could switch to a more elaborate strategy if the number
> of locks is high enough. Say, a hash table, or some hybrid approach.
>
>> I share your reluctance to add another low-level tunable, but like many
>> other GUCs, having a generally reasonable default that can be adjusted
>> is better than forcing folks to fork postgres to adjust a compile-time
>> constant.  And unfortunately I don't see a better way to solve this
>> problem.  Growing the lock_manager lwlock tranche isn't as effective,
>> because it doesn't help workloads where one or more relations are locked
>> frequently enough to hit this saturation point.
>>
>
> I understand. I have two concerns:
>
> 1) How would the users know they need to tune this / determine what's
> the right value, and what's the right value for their system.
>
> 2) Having to deal with misconfigured systems as people tend to blindly
> tune everything to 100x the default, because more is better :-(
>
>
>> Handling a larger percentage of heavyweight lock acquisitions via
>> fastpath instead of slowpath seems likely to help many high-throughput
>> workloads, since it avoids having to exclusively acquire an lwlock.  It
>> seemed like the least intrusive general-purpose solution we've come up
>> with so far.  That's why we wanted to solicit feedback or new ideas from
>> the community.  Currently, the only options folks have to solve this
>> class of saturation are through some combination of schema changes,
>> application changes, vertical scaling, and spreading the query rate
>> among more postgres instances.  Those are not feasible and efficient
>> options.  Lacking a better solution, exposing a GUC that rarely needs
>> tuning seems reasonable to me.
>>
>> Anyway, hopefully the extra context is helpful!  Please do share your
>> thoughts.
>>
>
> Absolutely! I think the next step for me is to go through the analysis
> again, and try to construct a couple of different workloads hitting this
> in some way.
>

FWIW I did some progress on this - I think I managed to reproduce the
issue on a synthetic workload (with a partitioned table, using variable
number of partitions / indexes). It's hard to say for sure how serious
the reproduced cases are, but I do see spikes of lock_manager wait
events, and so on.

That however brings me to the second step - I was planning to increase
the FP_LOCK_SLOTS_PER_BACKEND value and see how much it helps, and also
measure some of the negative impact, to get a better idea what the trade
offs are.

But I quickly realized it's far more complicated than just increasing
the define. The thing is, it's not enough to make fpRelId larger,
there's also fpLockBits, tracking additional info about the locks (lock
mode etc.). But FP_LOCK_SLOTS_PER_BACKEND does not affect that, it's
just that int64 is enough to store the bits for 16 fastpath locks.

Note: This means one of the "mitigations" in the analysis (just rebuild
postgres with custom FP_LOCK_SLOTS_PER_BACKEND value) won't work.

I tried to fix that in a naive way (replacing it with an int64 array,
with one value for 16 locks), but I must be missing something as there
are locking failures.

I'm not sure I'll have time to hack on this soon, but if someone else
wants to take a stab at it and produce a minimal patch, I might be able
to run more tests on it.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-08-06 17:23:19 Re: Use of additional index columns in rows filtering
Previous Message José Neves 2023-08-06 14:24:45 RE: CDC/ETL system on top of logical replication with pgoutput, custom client