Re: Make tuple deformation faster

From: John Naylor <johncnaylorls(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Andy Fan <zhihuifan1213(at)163(dot)com>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Make tuple deformation faster
Date: 2024-07-25 03:18:15
Message-ID: CANWCAZZe63DHpCEttKKf-sgj7726QtE0Vwm4jCX42a9x1oJ+=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 1, 2024 at 5:07 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> cycles idle
> 8505168 stalled-cycles-backend:u # 0.02% backend cycles idle
> 165442142326 instructions:u # 3.35 insn per cycle
> # 0.00 stalled
> cycles per insn
> 39409877343 branches:u # 3.945 G/sec
> 146350275 branch-misses:u # 0.37% of all branches

> patched

> cycles idle
> 24259785 stalled-cycles-backend:u # 0.05% backend cycles idle
> 213688149862 instructions:u # 4.29 insn per cycle
> # 0.00 stalled
> cycles per insn
> 44147675129 branches:u # 4.420 G/sec
> 14282567 branch-misses:u # 0.03% of all branches

> You can see the branch predictor has done a *much* better job in the
> patched code vs master with about 10x fewer misses. This should have

Nice!

> helped contribute to the "insn per cycle" increase. 4.29 is quite
> good for postgres. I often see that around 0.5. According to [1]
> (relating to Zen4), "We get a ridiculous 12 NOPs per cycle out of the
> micro-op cache". I'm unsure how micro-ops translate to "insn per
> cycle" that's shown in perf stat. I thought 4-5 was about the maximum
> pipeline size from today's era of CPUs.

"ins per cycle" is micro-ops retired (i.e. excludes those executed
speculatively on a mispredicted branch).

That article mentions that 6 micro-ops per cycle can enter the backend
from the frontend, but that can happen only with internally cached
ops, since only 4 instructions per cycle can be decoded. In specific
cases, CPUs can fuse multiple front-end instructions into a single
macro-op, which I think means a pair of micro-ops that can "travel
together" as one. The authors concluded further down that "Zen 4’s
reorder buffer is also special, because each entry can hold up to 4
NOPs. Pairs of NOPs are likely fused by the decoders, and pairs of
fused NOPs are fused again at the rename stage."

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2024-07-25 03:22:26 Re: Logical Replication of sequences
Previous Message Nathan Bossart 2024-07-25 03:16:51 Re: pg_upgrade and logical replication