From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
---|---|
To: | Andy Fan <zhihuifan1213(at)163(dot)com> |
Cc: | PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Make tuple deformation faster |
Date: | 2024-07-01 10:07:09 |
Message-ID: | CAApHDvrGtsB0Ww_LuTUwVhjd1VJQXDrJwbiE=06MxvUf4HU9SQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 1 Jul 2024 at 21:17, Andy Fan <zhihuifan1213(at)163(dot)com> wrote:
> Yet another a wonderful optimization! I just want to know how did you
> find this optimization (CPU cache hit) case and think it worths some
> time. because before we invest our time to optimize something, it is
> better know that we can get some measurable improvement after our time
> is spend. At for this case, 30% is really a huge number even it is a
> artificial case.
>
> Another case is Andrew introduced NullableDatum 5 years ago and said using
> it in TupleTableSlot could be CPU cache friendly, I can follow that, but
> how much it can improve in an ideal case, is it possible to forecast it
> somehow? I ask it here because both cases are optimizing for CPU cache..
Have a look at:
perf stat --pid=<backend pid>
On my AMD Zen4 machine running the 16 extra column test from the
script in my last email, I see:
$ echo master && perf stat --pid=389510 sleep 10
master
Performance counter stats for process id '389510':
9990.65 msec task-clock:u # 0.999 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
0 page-faults:u # 0.000 /sec
49407204156 cycles:u # 4.945 GHz
18529494 stalled-cycles-frontend:u # 0.04% frontend
cycles idle
8505168 stalled-cycles-backend:u # 0.02% backend cycles idle
165442142326 instructions:u # 3.35 insn per cycle
# 0.00 stalled
cycles per insn
39409877343 branches:u # 3.945 G/sec
146350275 branch-misses:u # 0.37% of all branches
10.001012132 seconds time elapsed
$ echo patched && perf stat --pid=380216 sleep 10
patched
Performance counter stats for process id '380216':
9989.14 msec task-clock:u # 0.998 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
0 page-faults:u # 0.000 /sec
49781280456 cycles:u # 4.984 GHz
22922276 stalled-cycles-frontend:u # 0.05% frontend
cycles idle
24259785 stalled-cycles-backend:u # 0.05% backend cycles idle
213688149862 instructions:u # 4.29 insn per cycle
# 0.00 stalled
cycles per insn
44147675129 branches:u # 4.420 G/sec
14282567 branch-misses:u # 0.03% of all branches
10.005034271 seconds time elapsed
You can see the branch predictor has done a *much* better job in the
patched code vs master with about 10x fewer misses. This should have
helped contribute to the "insn per cycle" increase. 4.29 is quite
good for postgres. I often see that around 0.5. According to [1]
(relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
micro-op cache". I'm unsure how micro-ops translate to "insn per
cycle" that's shown in perf stat. I thought 4-5 was about the maximum
pipeline size from today's era of CPUs. Maybe someone else can explain
better than I can. In more simple terms, generally, the higher the
"insn per cycle", the better. Also, the lower all of the idle and
branch miss percentages are that's generally also better. However,
you'll notice that the patched version has more front and backend
stalls. I assume this is due to performing more instructions per cycle
from improved branch prediction causing memory and instruction stalls
to occur more frequently, effectively (I think) it's just hitting the
next bottleneck(s) - memory and instruction decoding. At least, modern
CPUs should be able to out-pace RAM in many workloads, so perhaps it's
not that surprising that "backend cycles idle" has gone up due to such
a large increase in instructions per cycle due to improved branch
prediction.
It would be nice to see this tested on some modern Intel CPU. A 13th
series or 14th series, for example, or even any intel from the past 5
years would be better than nothing.
David
[1] https://chipsandcheese.com/2022/11/05/amds-zen-4-part-1-frontend-and-execution-engine/
From | Date | Subject | |
---|---|---|---|
Next Message | Matthias van de Meent | 2024-07-01 10:07:11 | Re: Make tuple deformation faster |
Previous Message | Ashutosh Bapat | 2024-07-01 10:01:38 | Re: cfbot update: Using GitHub for patch review |