Quick Links

Re: tweaking perfect hash multipliers

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	John Naylor <john(dot)naylor(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: tweaking perfect hash multipliers
Date:	2020-03-30 18:31:46
Message-ID:	20200330183146.nfvqclsy73tkxuwd@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2020-03-30 21:33:14 +0800, John Naylor wrote:
> Then I used the attached program to measure various combinations of
> compiled instructions using two constant multipliers iterating over
> bytes similar to a generated hash function.

It looks like you didn't attach the program?

> <cc> -O2 -Wall test-const-mult.c test-const-mult-2.c
> ./a.out
> Median of 3 with clang 10:
>
> lea, lea 0.181s
>
> lea, lea+add 0.248s
> lea, shift+add 0.251s
>
> lea+add, shift+add 0.273s
> shift+add, shift+add 0.276s
>
> 2 leas, 2 leas 0.290s
> shift+add, imul 0.329s
>
> Taking this with a grain of salt, it nonetheless seems plausible that
> a single lea could be faster than any two instructions here.

It's a bit complicated by the fact that there's more execution ports to
execute shift/add than there ports to compute some form of leas. And
some of that won't easily be measurable in a micro-benchmark, because
there'll be dependencies between the instruction preventing any
instruction level parallelism.

I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.

There's also a significant difference in latency that you might not see
in your benchmark. E.g. on coffee lake the relevant form of lea has a
latency of three cycles, but one independent lea can be "started" per
cycle (agner calls this "reciprocal throughput). Whereas a shift has a
latency of 1 cycle and a reciprocal throughput of 0.5 (lower is better),
add has a latency o 1 and a reciprocal throughput of 0.25.

See the tables in https://www.agner.org/optimize/instruction_tables.pdf

I'm not really sure my musings above matter terribly much, but I just
wanted to point out why I'd not take too much stock in the above timings
in isolation. Even a very high latency wouldn't necessarily be penalized
in a benchmark with one loop iteration independent from each other, but
would matter in the real world.

Cool work!

Greetings,

Andres Freund

In response to

tweaking perfect hash multipliers at 2020-03-30 13:33:14 from John Naylor

Responses

Re: tweaking perfect hash multipliers at 2020-03-30 19:10:59 from John Naylor
Re: tweaking perfect hash multipliers at 2020-03-31 08:05:55 from John Naylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Justin Pryzby	2020-03-30 18:34:39	Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Previous Message	Tom Lane	2020-03-30 18:28:12	Re: Recognizing superuser in pg_hba.conf