Re: gaussian distribution pgbench

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mitsumasa KONDO <kondo(dot)mitsumasa(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gaussian distribution pgbench
Date: 2014-07-17 20:13:24
Message-ID: alpine.DEB.2.10.1407172152170.3763@sto
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


>> However, ISTM that it is not the purpose of pgbench documentation to be a
>> primer about what is an exponential or gaussian distribution, so the idea
>> would yet be to have a relatively compact explanation, and that the
>> interested but clueless reader would document h..self from wikipedia or a
>> text book or a friend or a math teacher (who could be a friend as well:-).
>
> Well, I think it's a balance. I agree that the pgbench documentation
> shouldn't try to substitute for a text book or a math teacher, but I
> also think that you shouldn't necessarily need to refer to a text book
> or a math teacher in order to figure out how to use pgbench. Saying
> "it's complicated, so we don't have to explain it" would be a cop out;
> we need to *make* it simple. And if there's no way to do that, then
> IMHO we should reject the patch in favor of some future patch that
> implements something that will be easy for users to understand.
>
>>>> [nttcom(at)localhost postgresql]$ contrib/pgbench/pgbench --exponential=10
>>>> starting vacuum...end.
>>>> transaction type: Exponential distribution TPC-B (sort of)
>>>> scaling factor: 1
>>>> exponential threshold: 10.00000
>>>>
>>>> decile percents: 63.2% 23.3% 8.6% 3.1% 1.2% 0.4% 0.2% 0.1% 0.0% 0.0%
>>>> highest/lowest percent of the range: 9.5% 0.0%
>>>
>>> I don't have a clue what that means. None.
>>
>> Maybe we could add in front of the decile/percent
>>
>> "distribution of increasing account key values selected by pgbench:"
>
> I still wouldn't know what that meant. And it misses the point
> anyway: if the documentation is good, this will be unnecessary. If
> the documentation is bad, a printout that tries to illustrate it by
> example is not an acceptable substitute.

The decile description is quite classic when discussing statistics.

>>> Here is an example of an explanation that would make sense to me.
>>> This is not the actual behavior of your patch, I'm quite sure, so this
>>> is just an example of the *kind* of explanation that I think is
>>> needed:
>>
>> This is more or less the approximate behavior of the patch, but for 1% of
>> the range, not 50%. However I'm not sure that the current documentation is
>> so bad.
>
> I think it isn't, because in the system I described, a larger value
> indicates a flatter distribution, but in the documentation, a smaller
> value indicates a flatter distribution.

Ok. But the general thrust was ok.

> That having been said, I agree the current documentation for the
> exponential distribution is not too bad. But this part does not make
> sense:
>
> + A crude approximation of the distribution is that the most frequent 1%
> + values are drawn <replaceable>threshold</>% of the time.

I'm trying to be nice to the reader by providing an intuitive
information. I do not seem to succeed:-) I'm attempting to say that when
you draw from a range, say 1 to 1000, the first 1%, i.e. values 1 to 10,
are draw about "threshold"% of the time.

If I draw from one hundred values:

\setrandom x 1 100 exponential 10.0

The 1 will be drawn about 10% of the time, and the 99 next values will
share the remaining 90%.

> + The closer to 0.0 the threshold, the flatter (more uniform) the access
> + distribution.
>
> Given the first statement, I'd expect the lowest possible threshold to
> be 0.01, not 0.

This is in the sense of "epsilon", small number close to 0 but different
from 0. The lowest possible threshold is the smallest
strictly positive representable with a "double".

> The documentation for the Gaussian distribution is in somewhat worse
> shape. Unlike the documentation for exponential, it makes no attempt
> at all to give the user a clear idea what the distribution actually
> looks like. The closest it comes is this:
>
> + In other worlds, the larger the <replaceable>threshold</>,
> + the narrower the access range around the middle.
>
> But that's not really very close - there's no way for a user to judge
> what impact the threshold parameter actually has except to try it.
> Unlike the discussion of exponential, which contains a fairly-precise
> mathematical characterization of the behavior,

I have now added a precise formula for Gaussian. When you see the formula,
maybe you still would want see the decile to have an intuition.

I think that we assumed that the reader would know that a gaussian
distribution is the classic bell-shaped distribution, and if not .?he
would not be interested anyway.

> the Gaussian stuff has
> nothing except a hand-wavy explanation that a higher threshold skews
> the distribution more. (Also, the English expression is "in other
> words" not "in other worlds" - but in fact the phrase has no business
> in that sentence at all, because it is not reiterating the contents of
> the previous sentence in different language, but rather making a new
> point entirely. And the following sentence does not start with a
> capital letter, though maybe that's because it was intended to be
> incorporated into this sentence somehow.)
>
> I think that you also need to consider which instances of the words
> "gaussian" and "exponential" are referring to the option and which are
> referring to the abstract mathematical concept. When you're talking
> about the option, you should use all lower-case (as you've done) but
> with <literal> tags or similar. When you're referring to the
> mathematical distribution, Gaussian should be capitalized.
>
> BTW, I agree with both Heikki's suggestion that we make these options
> to setrandom only and not expose command-line options for them, and
> with Andres's critique that the documentation of those options is far
> too repetitive.

I'll have yet another ago at trying to improve the documentation, esp the
gaussian part. However you must allow that these are Mathematics, and the
user who wants to use these distribution will be expected to understand
what they are somehow beforehand.

Moreover, I cannot make it precise, intuitive and very short.

--
Fabien.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gavin Flower 2014-07-17 21:54:21 Re: [TODO] Process pg_hba.conf keywords as case-insensitive
Previous Message Tom Lane 2014-07-17 19:54:31 Re: BUFFER_LOCK_EXCLUSIVE is used in ginbuildempty().