Re: pgbench gaussian/exponential docs improvements

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: pgbench gaussian/exponential docs improvements
Date: 2015-10-25 19:11:50
Message-ID: alpine.DEB.2.11.1510251943520.12900@eriador
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Tomas,

> I've been looking at the checkpoint patches (sorting, flush and FPW
> compensation) and realized we got gaussian/exponential distributions in
> pgbench which is useful for simulating simple non-uniform workloads.

Indeed.

> But I think the current docs is a bit too difficult to understand for
> people without deep insight into statistics and shapes of probability
> distributions.

I think the idea is that (1) if you do not know anything distributions,
probably you do not want expo/gauss, and (2) pg documentation should not
try to be an introductory course in probability theory.

AFAICR I suggested to point to relevant wikipedia pages but this has been
more or less rejected, so it ended up as it is, which is indeed pretty
unconvincing.

> Firstly, it'd be nice if we could add some figures illustrating the
> distributions - much better than explaining the shapes in text. I don't
> know if we include figures in the existing docs (probably not), but
> generating the figures is rather simple.

There is basically no figures in the documentation. Too bad, but it is
understandable: what should be the format (svg, jpg, png, ...), should it
be generated (gnuplot, others), what is the impact on the documentation
build (html, epub, pdf, ...), how portable should it be, what about
compressed formats vs git diffs?

Once you start asking these questions you understand why there are no
figures:-)

> A few more comments:
>
>> By default, or when uniform is specified, all values in the range are
>> drawn with equal probability. Specifying gaussian or exponential
>> options modifies this behavior; each requires a mandatory threshold
>> which determines the precise shape of the distribution.
>
> I find the 'threshold' name to be rather unfortunate, as none of the
> probability distribution functions that I know use this term.

I think that it was proposed for gaussian, not sure why.

> And even if there's one probability function that uses 'threshold' it
> has very little meaning in the others. For example the exponential
> distribution uses 'rate' (lambda). I'd prefer a neutral name (e.g.
> 'parameter').

Why not. Many places to fix, though (documentation & source code).

>> For a Gaussian distribution, the interval is mapped onto a standard
>> normal distribution (the classical bell-shaped Gaussian curve)
>> truncated at -threshold on the left and +threshold on the right.
>
> Probably nitpicking, but left/right of what? I assume the normal
> distribution is placed at 0, so it's left/right of zero.

No, it is around the middle of the interval.

>> To be precise, if PHI(x) is the cumulative distribution function of
>> the standard normal distribution, with mean mu defined as (max + min)
>> / 2.0, then value i between min and max inclusive is drawn with
>> probability: (PHI(2.0 * threshold * (i - min - mu + 0.5) / (max -
>> min + 1)) - PHI(2.0 * threshold * (i - min - mu - 0.5) / (max - min +
>> 1))) / (2.0 * PHI(threshold) - 1.0). Intuitively, the larger the
>> threshold, the more frequently values close to the middle of the
>> interval are drawn, and the less frequently values close to the min
>> and max bounds.
>
> Could we simplify the equation a bit? It's needlessly difficult to realize
> it's actually just CDF(i+0.5) - CDF(i-0.5). I think it'd be good to first
> define the CDF and then just use that.

ISTM that PHI is *the* normal CDF, which is more or less available as such
in various environment (matlab, python, excel...). Well, why not defined
the particular CDF and use it. Not sure the text would be that much
lighter, though.

>> About 67% of values are drawn from the middle 1.0 / threshold and 95%
>> in the middle 2.0 / threshold; for instance, if threshold is 4.0, 67%
>> of values are drawn from the middle quarter and 95% from the middle
>> half of the interval.
>
> This seems broken - too many sentences about the 67% and 95%.

The point is to provide rules of thumb to describe how the distribution is
shaped. Any better sentence is welcome.

>> The minimum threshold is 2.0 for performance of the Box-Muller
>> transform.
>
> Does it make sense to explicitly mention the implementation detail
> (Box-Muller transform) here?

It is too complex, I would avoid it. I would point to the wikipedia page
if that could be allowed.

https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2015-10-25 19:39:19 Re: Freezing without cleanup lock
Previous Message Tomas Vondra 2015-10-25 18:12:27 pgbench gaussian/exponential docs improvements