pgbench gaussian/exponential docs improvements

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: pgbench gaussian/exponential docs improvements
Date: 2015-10-25 18:12:27
Message-ID: 562D1B8B.90500@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've been looking at the checkpoint patches (sorting, flush and FPW
compensation) and realized we got gaussian/exponential distributions in
pgbench which is useful for simulating simple non-uniform workloads.

But I think the current docs is a bit too difficult to understand for
people without deep insight into statistics and shapes of probability
distributions.

Firstly, it'd be nice if we could add some figures illustrating the
distributions - much better than explaining the shapes in text. I don't
know if we include figures in the existing docs (probably not), but
generating the figures is rather simple.

A few more comments:

> By default, or when uniform is specified, all values in the range are
> drawn with equal probability. Specifying gaussian or exponential
> options modifies this behavior; each requires a mandatory threshold
> which determines the precise shape of the distribution.

I find the 'threshold' name to be rather unfortunate, as none of the
probability distribution functions that I know use this term. And even
if there's one probability function that uses 'threshold' it has very
little meaning in the others. For example the exponential distribution
uses 'rate' (lambda). I'd prefer a neutral name (e.g. 'parameter').

> For a Gaussian distribution, the interval is mapped onto a standard
> normal distribution (the classical bell-shaped Gaussian curve)
> truncated at -threshold on the left and +threshold on the right.

Probably nitpicking, but left/right of what? I assume the normal
distribution is placed at 0, so it's left/right of zero.

> To be precise, if PHI(x) is the cumulative distribution function of
> the standard normal distribution, with mean mu defined as (max + min)
> / 2.0, then value i between min and max inclusive is drawn with
> probability: (PHI(2.0 * threshold * (i - min - mu + 0.5) / (max -
> min + 1)) - PHI(2.0 * threshold * (i - min - mu - 0.5) / (max - min +
> 1))) / (2.0 * PHI(threshold) - 1.0). Intuitively, the larger the
> threshold, the more frequently values close to the middle of the
> interval are drawn, and the less frequently values close to the min
> and max bounds.

Could we simplify the equation a bit? It's needlessly difficult to
realize it's actually just CDF(i+0.5) - CDF(i-0.5). I think it'd be good
to first define the CDF and then just use that.

> About 67% of values are drawn from the middle 1.0 / threshold and 95%
> in the middle 2.0 / threshold; for instance, if threshold is 4.0, 67%
> of values are drawn from the middle quarter and 95% from the middle
> half of the interval.

This seems broken - too many sentences about the 67% and 95%.

> The minimum threshold is 2.0 for performance of the Box-Muller
> transform.

Does it make sense to explicitly mention the implementation detail
(Box-Muller transform) here?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2015-10-25 19:11:50 Re: pgbench gaussian/exponential docs improvements
Previous Message Simon Riggs 2015-10-25 15:59:55 Re: make Gather node projection-capable