From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: tweaking perfect hash multipliers |
Date: | 2020-03-30 18:31:46 |
Message-ID: | 20200330183146.nfvqclsy73tkxuwd@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2020-03-30 21:33:14 +0800, John Naylor wrote:
> Then I used the attached program to measure various combinations of
> compiled instructions using two constant multipliers iterating over
> bytes similar to a generated hash function.
It looks like you didn't attach the program?
> <cc> -O2 -Wall test-const-mult.c test-const-mult-2.c
> ./a.out
> Median of 3 with clang 10:
>
> lea, lea 0.181s
>
> lea, lea+add 0.248s
> lea, shift+add 0.251s
>
> lea+add, shift+add 0.273s
> shift+add, shift+add 0.276s
>
> 2 leas, 2 leas 0.290s
> shift+add, imul 0.329s
>
> Taking this with a grain of salt, it nonetheless seems plausible that
> a single lea could be faster than any two instructions here.
It's a bit complicated by the fact that there's more execution ports to
execute shift/add than there ports to compute some form of leas. And
some of that won't easily be measurable in a micro-benchmark, because
there'll be dependencies between the instruction preventing any
instruction level parallelism.
I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.
There's also a significant difference in latency that you might not see
in your benchmark. E.g. on coffee lake the relevant form of lea has a
latency of three cycles, but one independent lea can be "started" per
cycle (agner calls this "reciprocal throughput). Whereas a shift has a
latency of 1 cycle and a reciprocal throughput of 0.5 (lower is better),
add has a latency o 1 and a reciprocal throughput of 0.25.
See the tables in https://www.agner.org/optimize/instruction_tables.pdf
I'm not really sure my musings above matter terribly much, but I just
wanted to point out why I'd not take too much stock in the above timings
in isolation. Even a very high latency wouldn't necessarily be penalized
in a benchmark with one loop iteration independent from each other, but
would matter in the real world.
Cool work!
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Justin Pryzby | 2020-03-30 18:34:39 | Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly |
Previous Message | Tom Lane | 2020-03-30 18:28:12 | Re: Recognizing superuser in pg_hba.conf |