Quick Links

Re: scalability bottlenecks with (many) partitions (and more)

From:	Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: scalability bottlenecks with (many) partitions (and more)
Date:	2024-11-20 16:58:05
Message-ID:	CAEze2Wgbr_UcMQsj2sJ9bbXtuDs0b0RF=RmpqLPRzGCD=Fn_Mg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 4 Sept 2024 at 17:32, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 9/4/24 16:25, Matthias van de Meent wrote:
> > On Tue, 3 Sept 2024 at 18:20, Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> >> FWIW the actual cost is somewhat higher, because we seem to need ~400B
> >> for every lock (not just the 150B for the LOCK struct).
> >
> > We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
> > inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
> > dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
> > buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
> > used[^1]. Does that align with your usage numbers, or are they
> > significantly larger?
> >
>
> I see more like ~470B per lock. If I patch CalculateShmemSize to log the
> shmem allocated, I get this:
>
> max_connections=100 max_locks_per_transaction=1000 => 194264001
> max_connections=100 max_locks_per_transaction=2000 => 241756967
>
> and (((241756967-194264001)/100/1000)) = 474
>
> Could be alignment of structs or something, not sure.

NLOCKENTS is calculated based off of MaxBackends, which is the sum of
MaxConnections + autovacuum_max_workers + 1 +
max_worker_processes + max_wal_senders; which by default add
22 more slots.

After adjusting for that, we get 388 bytes /lock, which is
approximately in line with the calculation.

> >> At least based on a quick experiment. (Seems a bit high, right?).
> >
> > Yeah, that does seem high, thanks for nerd-sniping me.
[...]
> > Alltogether that'd save 40 bytes/lock entry on size, and ~35
> > bytes/lock on "safety margin", for a saving of (up to) 19% of our
> > current allocation. I'm not sure if these tricks would benefit with
> > performance or even be a demerit, apart from smaller structs usually
> > being better at fitting better in CPU caches.
> >
>
> Not sure either, but it seems worth exploring. If you do an experimental
> patch for the LOCK size reduction, I can get some numbers.

It took me some time to get back to this, and a few hours to
experiment, but here's that experimental patch. Attached 4 patches,
which together reduce the size of the shared lock tables by about 34%
on my 64-bit system.

1/4 implements the MAX_LOCKMODES changes to LOCK I mentioned before,
saving 16 bytes.
2/4 packs the LOCK struct more tightly, for another 8 bytes saved.
3/4 reduces the PROCLOCK struct size by 8 bytes with a PGPROC* ->
ProcNumber substitution, allowing packing with fields previously
reduced in size in patch 2/4.
4/4 reduces the size fo the PROCLOCK table by limiting the average
number of per-backend locks to max_locks_per_transaction (rather than
the current 2*max_locks_per_transaction when getting locks that other
backends also requested), and makes the shared lock tables fully
pre-allocated.

1-3 together save 11% on the lock tables in 64-bit builds, and 4/4
saves another ~25%, for a total of ~34% on per-lockentry shared memory
usage; from ~360 bytes to ~240 bytes.

Note that this doesn't include the ~4.5 bytes added per PGPROC entry
per mlpxid for fastpath locking; I've ignored those for now.

Not implemented, but technically possible: the PROCLOCK table _could_
be further reduced in size by acknowledging that each of that struct
is always stored after dynahash HASHELEMENTs, which have 4 bytes of
padding on 64-bit systems. By changing PROCLOCKTAG's myProc to
ProcNumber, one could pack that field into the padding of the hash
element header, reducing the effective size of the hash table's
entries by 8 bytes, and thus the total size of the tables by another
few %. I don't think that trade-off is worth it though, given the
complexity and trickery required to get that to work well.

> I'm not sure about the safety margins. 10% sure seems like quite a bit
> of memory (it might not have in the past, but as the instances are
> growing, that probably changed).

I have not yet touched this safety margin.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment	Content-Type	Size
v0-0002-Reduce-size-of-LOCK-by-8-more-bytes.patch	application/octet-stream	13.4 KB
v0-0001-Reduce-size-of-LOCK-by-16-bytes.patch	application/octet-stream	14.5 KB
v0-0003-Reduce-size-of-PROCLOCK-by-8-bytes-on-64-bit-syst.patch	application/octet-stream	4.8 KB
v0-0004-Reduce-PROCLOCK-hash-table-size.patch	application/octet-stream	3.1 KB

In response to

Re: scalability bottlenecks with (many) partitions (and more) at 2024-09-04 15:32:55 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bruce Momjian	2024-11-20 17:00:28	Re: Statistics Import and Export
Previous Message	Nathan Bossart	2024-11-20 16:33:39	Re: sunsetting md5 password support