From: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> |
---|---|
To: | Ildar Musin <i(dot)musin(at)postgrespro(dot)ru> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: General purpose hashing func in pgbench |
Date: | 2017-12-20 07:36:27 |
Message-ID: | alpine.DEB.2.20.1712200807550.5165@lancre |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Ildar,
> Following up the recent discussion on zipfian distribution I was trying
> to reproduce some YCSB-like workloads. As this paper [1] describes, YCSB
> uses zipfian distribution to generate keys in order simulate intensive
> load on small number of records as it happens in real world applications
> (e.g. blogs). One problem is that most popular records keys are
> clustered together. To scatter them across the keyspace authors use
> hashing, the FNV-1a hash function in particular [2].
Yes, clustering is an issue for realistic load tests and may influence the
resulting measured performance unduly.
> I've read Fabien Coelho's thread on additional operators and functions.
> Generally it could be possible to implement some basic hashing
> algorithms right in a pgbench script using just bitwise and arithmetic
> operators. But should we probably provide users with some general
> purpose hash function?
The problem I see with hash functions is that it generates collisions,
which is an undesirable effect when dealing with keys as it biases the
initial randomization.
Thus I intend to submit a parametric pseudo-random permutation function,
some day. As I foresaw that it might require some arguing I did not
include the function in the operators and functions collection I
submitted, so as not to slow down the inclusion process. Given that the
patch was submitted over 20 months ago and is still stuck in the CF, that
was a misjudgement:-)
This is no no way a point against including one/some hash functions,
especially if they actually appear in a benchmark.
> The attached patch introduces hash() function which implements FNV-1a as
> an example of such hashing algorithm. There are also couple of images in
> the attachement that I have got from visualizing original zipfian
> distribution and the hashed one. Usage example:
> In psql:
> create table abc as select generate_series(0, 999) as a, 0 as b;
>
> pgbench script:
> \set rnd random_zipfian(0, 1000000, 0.99)
> \set key abs(hash(:rnd)) % 1000
> begin;
> update abc set b = b + 1 where a = :key;
> end;
>
> Any thoughts or suggestions?
As there may be several hash functions included in the long run. I'd
suggest that the hash function should be named more precisely, eg
"hash_fnv1a".
The image looks like the distribution is more regularly scattered than
actually randomized... Maybe this is because the first highest 256 values
are really scattered by the process multiply/modulo process. Or maybe this
is an optical effect?
ISTM that there are undesired utf8 chars in a comment. Should be kept
ASCII.
I would put the actual hash computation in a separate function rather than
inlined in the evaluator.
Add the submission to the next CF?
--
Fabien.
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Langote | 2017-12-20 08:27:20 | Re: [HACKERS] path toward faster partition pruning |
Previous Message | Masahiko Sawada | 2017-12-20 07:06:46 | Re: User defined data types in Logical Replication |