Re: Some performance degradation in REL_16 vs REL_15

From: Andres Freund <andres(at)anarazel(dot)de>
To: "Anton A(dot) Melnikov" <a(dot)melnikov(at)postgrespro(dot)ru>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, iamqyh(at)gmail(dot)com, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Some performance degradation in REL_16 vs REL_15
Date: 2023-11-15 18:04:33
Message-ID: 20231115180433.p3eeaczbam5zxdz5@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-15 11:33:44 +0300, Anton A. Melnikov wrote:
> The configure options and test scripts on my pc and server were the same:
> export CFLAGS="-O2"
> ./configure --enable-debug --with-perl --with-icu --enable-depend --enable-tap-tests
> #reinstall
> #reinitdb
> #create database bench
> for ((i=0; i<100; i++)); do pgbench -U postgres -i -s8 bench> /dev/null 2>&1;
> psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c6 -T20 -j6 bench;

Even with scale 8 you're likely significantly impacted by contention. And
obviously WAL write latency. See below for why that matters.

> I can't understand why i get the opposite results on my pc and on the server. It is clear that the absolute
> TPS values will be different for various configurations. This is normal. But differences?
> Is it unlikely that some kind of reference configuration is needed to accurately
> measure the difference in performance. Probably something wrong with my pc, but now
> i can not figure out what's wrong.

One very common reason for symptoms like this are power-saving measures by the
CPU. In workloads where the CPU is not meaningfully utilized, the CPU will go
into a powersaving mode - which can cause workloads that are latency sensitive
to be badly affected. Both because initially the cpu will just work at a
lower frequency and because it takes time to shift to a higher latency.

Here's an example:
I bound the server and psql to the same CPU core (nothing else is allowed to
use that core. And ran the following:

\o /dev/null
SELECT 1; SELECT 1; SELECT 1; SELECT pg_sleep(0.1); SELECT 1; SELECT 1; SELECT 1;
Time: 0.181 ms
Time: 0.085 ms
Time: 0.071 ms
Time: 100.474 ms
Time: 0.153 ms
Time: 0.077 ms
Time: 0.069 ms

You can see how the first query timing was slower, the next two were faster,
and then after the pg_sleep() it's slow again.

# tell the CPU to optimize for performance not power
cpupower frequency-set --governor performance

# disable going to lower power states
cpupower idle-set -D0

# disable turbo mode for consistent performance
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Now the timings are:
Time: 0.038 ms
Time: 0.028 ms
Time: 0.025 ms
Time: 1000.262 ms (00:01.000)
Time: 0.027 ms
Time: 0.024 ms
Time: 0.023 ms

Look, fast and reasonably consistent timings.

Switching back:
Time: 0.155 ms
Time: 0.123 ms
Time: 0.074 ms
Time: 1001.235 ms (00:01.001)
Time: 0.120 ms
Time: 0.077 ms
Time: 0.068 ms

The perverse thing is that this often means that *reducing* the number of
instructions executed yields to *worse* behaviour when under non-sustained
load, because from the CPUs point of view there is less need to increase clock
speed.

To show how much of a difference that can make, I ran pgbench with a single
client on one core, and the server on another (so the CPU is idle inbetween):
numactl --physcpubind 11 pgbench -n -M prepared -P1 -S -c 1 -T10

With power optimized configuration:
latency average = 0.035 ms
latency stddev = 0.002 ms
initial connection time = 5.255 ms
tps = 28434.334672 (without initial connection time)

With performance optimized configuration:
latency average = 0.025 ms
latency stddev = 0.001 ms
initial connection time = 3.544 ms
tps = 40079.995935 (without initial connection time)

That's a whopping 1.4x in throughput!

Now, the same thing, except that I used a custom workload where pgbench
transactions are executed in a pipelined fashion, 100 read-only transactions
in one script execution:

With power optimized configuration:
latency average = 1.055 ms
latency stddev = 0.125 ms
initial connection time = 6.915 ms
tps = 947.985286 (without initial connection time)

(this means we actually executed 94798.5286 readonly pgbench transactions/s)

With performance optimized configuration:
latency average = 1.376 ms
latency stddev = 0.083 ms
initial connection time = 3.759 ms
tps = 726.849018 (without initial connection time)

Suddenly the super-duper performance optimized settings are worse (but note
that stddev is down)! I suspect the problem is that now because we disabled
idle states, the cpu ends up clocking *lower*, due to power usage.

If I just change the relevant *cores* to the performance optimized
configuration:

cpupower -c 10,11 idle-set -D0; cpupower -c 10,11 frequency-set --governor performance

latency average = 0.940 ms
latency stddev = 0.061 ms
initial connection time = 3.311 ms
tps = 1063.719116 (without initial connection time)

It wins again.

Now, realistically you'd never use -D0 (i.e. disabling all downclocking, not
just lower states) - the power differential is quite big and as shown here it
can hurt performance as well.

On an idle system, looking at the cpu power usage with:
powerstat -D -R 5 1000

Time User Nice Sys Idle IO Run Ctxt/s IRQ/s Fork Exec Exit Watts pkg-0 dram pkg-1
09:45:03 0.6 0.0 0.2 99.2 0.0 1 2861 2823 0 0 0 46.84 24.82 3.68 18.33
09:45:08 1.0 0.0 0.1 99.0 0.0 2 2565 1602 0 0 0 54.78 28.92 4.74 21.12
09:45:13 0.8 0.0 0.3 98.9 0.0 1 3769 2322 0 0 0 55.65 29.43 4.72 21.50
09:45:18 0.8 0.0 0.1 99.1 0.0 3 2513 1479 0 0 0 51.95 27.47 4.23 20.24
09:45:23 0.6 0.0 0.1 99.3 0.0 1 2282 1448 0 0 0 49.44 26.12 3.91 19.41
09:45:28 0.8 0.0 0.1 99.1 0.0 2 2422 1465 0 0 0 51.79 27.33 4.27 20.19
09:45:33 0.9 0.0 0.1 99.0 0.0 2 2358 1566 0 0 0 55.05 29.03 4.73 21.29
09:45:38 0.6 0.0 0.1 99.4 0.0 1 2976 4207 0 0 0 54.38 29.08 4.02 21.28
09:45:43 1.2 0.0 0.2 98.6 0.0 2 3988 37670 0 0 0 125.51 64.15 7.97 53.39
09:45:48 0.6 0.0 0.2 99.2 0.0 2 3263 40000 0 0 0 126.31 63.84 7.97 54.50
09:45:53 0.4 0.0 0.0 99.6 0.0 1 2333 39716 0 0 0 125.94 63.64 7.90 54.40
09:45:58 0.3 0.0 0.0 99.6 0.0 1 2783 39795 0 0 0 125.93 63.58 7.90 54.44
09:46:03 0.6 0.0 0.2 99.2 0.0 1 3081 24910 0 0 0 93.55 47.69 6.10 39.75
09:46:08 0.5 0.0 0.1 99.4 0.0 1 2623 1356 0 0 0 42.65 22.59 3.18 16.89

(interesting columns are Watts, pkg-0, dram, pkg-1)

Initially this was with cpupower idle-set -E, then I ran cpupower idle-set
-D0, and then cpupower idle-set -E - you can easily see where.

If I use a more sensible -D5 system-wide and run the pgbench from earlier, I
get the performance *and* sensible power usage:

latency average = 0.946 ms
latency stddev = 0.063 ms
initial connection time = 3.482 ms
tps = 1057.413946 (without initial connection time)

09:50:36 0.4 0.0 0.1 99.5 0.0 1 2593 1206 0 0 0 42.63 22.47 3.18 16.97

To get back to your benchmark: Because you're measuring a highly contended
system, where most of the time postgres will just wait for
a) row-level locks to be released
b) WAL writes to finish (you're clearly not using a very fast disk)
the CPU would have plenty time to clock down.

Benchmarking sucks. Modern hardware realities suck.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2023-11-15 18:05:22 Re: pg_upgrade and logical replication
Previous Message vignesh C 2023-11-15 18:03:28 Re: pg_upgrade and logical replication