Re: [WIP] Zipfian distribution in pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Alik Khilazhev <a(dot)khilazhev(at)postgrespro(dot)ru>
Cc: PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Zipfian distribution in pgbench
Date: 2017-08-24 02:39:03
Message-ID: alpine.DEB.2.20.1708240433560.24526@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Alik,

> I am attaching patch v7.

Patch generates multiple warnings with "git apply", apparently because of
end-of-line spaces, and fails:

pgbench-zipf-07v.patch:52: trailing whitespace.
{
pgbench-zipf-07v.patch:53: trailing whitespace.
"random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN
pgbench-zipf-07v.patch:54: trailing whitespace.
},
pgbench-zipf-07v.patch:66: trailing whitespace.
#define ZIPF_CACHE_SIZE 15 /* cache cells number */
pgbench-zipf-07v.patch:67: trailing whitespace.

error: patch failed: doc/src/sgml/ref/pgbench.sgml:1013
error: doc/src/sgml/ref/pgbench.sgml: patch does not apply
error: patch failed: src/bin/pgbench/exprparse.y:191
error: src/bin/pgbench/exprparse.y: patch does not apply
error: patch failed: src/bin/pgbench/pgbench.c:93
error: src/bin/pgbench/pgbench.c: patch does not apply
error: patch failed: src/bin/pgbench/pgbench.h:75
error: src/bin/pgbench/pgbench.h: patch does not apply

It also complains with "patch", although it seems to work:

(Stripping trailing CRs from patch; use --binary to disable.)
patching file doc/src/sgml/ref/pgbench.sgml
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/bin/pgbench/exprparse.y
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/bin/pgbench/pgbench.c
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/bin/pgbench/pgbench.h
patch unexpectedly ends in middle of line
patch unexpectedly ends in middle of line

Please do not put spaces at the end of lines, eg there is a lonely tab on
the second line of "computeHarmonicZipfian".

It seems that "CR" characters are used:

sh> sha1sum ~/pgbench-zipf-07v.patch
439ad1de0a960b3b415ff6de9412b3cc4d6922e7

sh> file ~/pgbench-zipf-07v.patch
pgbench-zipf-07v.patch: unified diff output, ASCII text, with CRLF line terminators

Compilation issues one warning:

pgbench.c: In function ‘computeHarmonicZipfian’:
pgbench.c:894:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
double uniform = pg_erand48(thread->random_state);

Please conform to pg standard for portability, even if it is a very old one:-)

About the documentation:

I would suggest to replace 0.5 in the first example by 1.5, as a zipfian
distribution with a lower-than-one parameter is not that well defined.

I'm fine with using references to papers or books for the algorithm.

The documentation lines in the sgml file are too long. At least
limit text paragraphs to maybe 80 columns as done elsewhere.

typo: "<entry> Zipfian" (remove space)

typo: "used(described" (missing space)

typo: "numbers(algorithm" (again)

The sentence "It's recommended to pick <replaceable>parameter</> in range
(0, 1) for better accuracy." does not make sense with the updated
generation algorithm.

The text would benefit from a review by a native English speaker, who I am
not. Anyway here is an attempt at improving/simplifying the text (I have
removed sgml formatting). If some native English speaker could review it,
that would be great.

"""
"random_zipfian" generates an approximated bounded zipfian distribution.
For parameter in (0,1), an approximated algorithm is taken from
"Quickly Generating Billion-Record Synthetic Databases", Jim Gray et al, SIGMOD 1994.
For parameter in (1, 1000), a rejection method is used, based on
"Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551, Springer 1986.
The distribution is not defined when the parameter's value is 1.0.
The drawing performance is poor for parameter values close and above 1.0
and on a small range.

"parameter" defines how skewed the distribution is.
The larger the "parameter", the more frequently values close to the beginning
of the interval are drawn.
The closer to 0 "parameter" is, the flatter (more uniform) the access distribution.
"""

English in comments and code:

"inited" is not English, I think you want "initialized". Maybe "nb_cells"
would do.

"it is not exist" (multiple instances) -> "it does not exist".

I think that the references to the paper/book should appear as comments in
the code, for instance in front of the functions which implement the
algorithm.

"current_time", "last_touch_time" are counter based, they are not "time".
I suggest to update the name to "current" and "last_touch" or "last_used".

"time when cell was used last time" -> "last used logical time"

About the code:

I would prefer that you use "1.0" instead of "1." for double constants.

I think that getZipfianRand should be symmetric. Maybe a direct formula
could do:

return min - 1 + (s > 1) ? xxx() : yyy();

or at least use an explicit "else" for symmetry.

Function "zipfn" asks for s and n as arguments, but ISTM that they are
already include as fields in cell, so this seems redundant. Moreover I do
not think that the zipfn function is that useful. I would suggest to
inline it in its caller.

"zipfGeneralizedHarmonic" is not really related to zipf, it is just used
there. I suggest to call it "generalizedHarmonicNumber" to reflect what it
does.

I suggest to put all LRU management in the getCell* function, that is to
move the last_use update from computeHarmonicZipfian to getCell.

computeHarminicZipfian and computeIterativeZipfian should have the same
signature, not one with a random state and the other with the thread. I
suggest to do the same as other random functions, whatever it is.

Otherwise, no action required:

The patch probably conflicst with other patches in the CF, which means
that there will be rebase needed. That is life. The queue is long and
slow.

Also, this function deserve some tap tests, that should be added if the
"pgbench tap test" patch get through. Otherwise it is as well tested as
everything else around, which is basically no test.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-08-24 02:43:41 Re: [PATCH] Push limit to sort through a subquery
Previous Message Fabien COELHO 2017-08-24 02:29:26 Re: pgbench tap tests & minor fixes