Re: scalability bottlenecks with (many) partitions (and more)

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: scalability bottlenecks with (many) partitions (and more)
Date: 2024-09-06 11:56:58
Message-ID: CAKZiRmwGnYuJiugr26+F6=C2m60Gg=Ekb40E5FVaPpdoNuP7+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 5, 2024 at 7:33 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:

>>> My $0.02 cents: the originating case that triggered those patches,
>>> actually started with LWLock/lock_manager waits being the top#1. The
>>> operator can cross check (join) that with a group by pg_locks.fastpath
>>> (='f'), count(*). So, IMHO we have good observability in this case
>>> (rare thing to say!)
>>>
>>
>> That's a good point. So if you had to give some instructions to users
>> what to measure / monitor, and how to adjust the GUC based on that, what
>> would your instructions be?
>>
>
> After thinking about this a bit more, I'm actually wondering if this is
> source of information is sufficient. Firstly, it's just a snapshot of a
> single instance, and it's not quite trivial to get some summary for
> longer time period (people would have to sample it often enough, etc.).
> Doable, but much less convenient than the cumulative counters.

OK, so answering previous question:

Probably just monitor pg_stat_activty (group on wait events count(*))
with pg_locks with group by on per-pid and fastpath . Even simple
observations with \watch 0.1 are good enough to confirm/deny the
root-cause in practice even for short bursts while it is happening.
While deploying monitoring for a longer time (with say sample of 1s),
you sooner or later would get the __high water mark__ and possibly
allow up to that many fastpaths as starting point as there are locks
occuring for affected PIDs (or double the amount).

> But for the sampling, doesn't this produce skewed data? Imagine you have
> a workload with very short queries (which is when fast-path matters), so
> you're likely to see the backend while it's obtaining the locks. If the
> fast-path locks take much faster acquire (kinda the whole point), we're
> more likely to see the backend while it's obtaining the regular locks.
>
> Let's say the backend needs 1000 locks, and 500 of those fit into the
> fast-path array. We're likely to see the 500 fast-path locks already
> acquired, and a random fraction of the 500 non-fast-path locks. So in
> the end you'll se backends needing 500 fast-path locks and 250 regular
> locks. That doesn't seem terrible, but I guess the difference can be
> made even larger.

... it doesn't need to perfect data to act, right? We may just need
information that it is happening (well we do). Maybe it's too
pragmatic point of view, but wasting some bits of memory for this, but
still being allowed to control it how much it allocates in the end --
is much better situation than today, without any control where we are
wasting crazy CPU time on all those futex() syscalls and context
switches

Another angle is that if you see the SQL causing it, it is most often
going to be attributed to partitioning and people ending up accessing
way too many partitions (thousands) without proper partition pruning -
sometimes it even triggers re-partitioning of the said tables. So
maybe the realistic "fastpath sizing" should assume something that
supports:
a) usual number of tables in JOINs (just few of them are fine like today) -> ok
b) interval 1 month partitions for let's say 5 years (12*5 = 60),
joined to some other table like that gives like what, max 120? -> so
if you have users doing SELECT * FROM such_table , they will already
have set the max_locks_per_xact probably to something higher.
c) HASH partitioning up to VCPU-that-are-in-the-wild count? (say 64 or
128? so it sounds same as above?)
d) probably we should not care here at all if somebody wants daily
partitioning across years with HASH (sub)partitions without partition
pruning -> it has nothing to do with being "fast" anyway

Judging from the current reports, people have configured
max_locks_per_xact like this: ~75% have it at default (64), 10% has
1024, 5% has 128 and the rest (~10%) is having between 100..thousands,
with extreme one-offs @ 25k (wild misconfiguration judging from the
other specs).

BTW: you probably need to register this $thread into CF for others to
see too (it's not there?)

-J.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2024-09-06 11:58:10 Re: gamma() and lgamma() functions
Previous Message Jim Jones 2024-09-06 11:55:06 Re: [BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes