Re: Avoid stack frame setup in performance critical routines using tail calls

From: Andres Freund <andres(at)anarazel(dot)de>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Subject: Re: Avoid stack frame setup in performance critical routines using tail calls
Date: 2021-07-20 06:16:57
Message-ID: 20210720061657.bcueir3krgmkt6m5@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2021-07-20 16:50:09 +1200, David Rowley wrote:
> I've not taken the time to study the patch but I was running some
> other benchmarks today on a small scale pgbench readonly test and I
> took this patch for a spin to see if I could see the same performance
> gains.

Thanks!

> This is an AMD 3990x machine that seems to get the most throughput
> from pgbench with 132 processes
>
> I did: pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared
> --random-seed=12345 postgres
>
> master = dd498998a
>
> Master: 3816959.53 tps
> Patched: 3820723.252 tps
>
> I didn't quite get the same 2-3% as you did, but it did come out
> faster than on master.

It would not at all be suprising to me if AMD in recent microarchitectures did
a better job at removing stack management overview (e.g. by better register
renaming, or by resolving dependencies on %rsp in a smarter way) than Intel
has. This was on a Cascade Lake CPU (xeon 5215), which, despite being released
in 2019, effectively is a moderately polished (or maybe shoehorned)
microarchitecture from 2015 due to all the Intel troubles. Whereas Zen2 is
from 2019.

It's also possible that my attempts at avoiding the stack management just
didn't work on your compiler. Either due to vendor (I know that gcc is better
at it than clang), version, or compiler flags (e.g. -fno-omit-frame-pointer
could make it harder, -fno-optimize-sibling-calls would disable it).

A third plausible explanation for the difference is that at a client count of
132, the bottlenecks are sufficiently elsewhere to just not show a meaningful
gain from memory management efficiency improvements.

Any chance you could show a `perf annotate AllocSetAlloc` and `perf annotate
palloc` from a patched run? And perhaps how high their percentages of the
total work are. E.g. using something like
perf report -g none|grep -E 'AllocSetAlloc|palloc|MemoryContextAlloc|pfree'

It'd be interesting to know where the bottlenecks on a zen2 machine are.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2021-07-20 06:53:39 Re: Avoid stack frame setup in performance critical routines using tail calls
Previous Message Greg Nancarrow 2021-07-20 06:08:01 Re: row filtering for logical replication