From: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: pgbench gaussian/exponential docs improvements |
Date: | 2015-10-25 21:01:37 |
Message-ID: | alpine.DEB.2.10.1510252141040.24734@sto |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> [...]
>
> So either the information is important and then should be placed in the
> docs directly, or it's not and then linking to wikipedia is pointless
> because the users are not interested in learning all the details about
> each distribution function.
What is important is that these distributions can be used from pgbench.
What is a gaussian or an exponential distribution is *not* important as
such.
For me it is not the point of pg documentation to explain probability
theory, but just to provide *precise* information about what is actually
available, for someone who would be interested, without having to read the
source code. At least that is the idea behind the current documentation.
>>> Firstly, it'd be nice if we could add some figures illustrating the
>>> distributions - much better than explaining the shapes in text. I
>>> don't know if we include figures in the existing docs (probably not),
>>> but generating the figures is rather simple.
>>
>> There is basically no figures in the documentation. Too bad, but it is
>> understandable: what should be the format (svg, jpg, png, ...), should
>> it be generated (gnuplot, others), what is the impact on the
>> documentation build (html, epub, pdf, ...), how portable should it be,
>> what about compressed formats vs git diffs?
>>
>> Once you start asking these questions you understand why there are no
>> figures:-)
>
> I don't see why diffs would be a problem.
I was not only thinking of mathematical figures, I was also thinking of
graphics, some format may be zip containing XML stuff for instance.
>>> Probably nitpicking, but left/right of what? I assume the normal
>>> distribution is placed at 0, so it's left/right of zero.
>>
>> No, it is around the middle of the interval.
>
> You mean [min,max] interval?
Yep.
> I believe the transformation
>
> 2.0 * threshold * (i - min - mu + 0.5) / (max - min + 1)
>
> essentially moves the mean into 0, scales the data to [0,1] and then applies
> the threshold.
Probably:-) I wrote that some time ago, and it is 10 pm for me:-).
> In other words, the general shape of the curve will be exactly the same no
> matter the actual min/max (except that for longer intervals the values will
> be lower, as there are more possible values).
>
> I don't really see how it's related to this?
>
> [(max-min)/2 - thresholds, (max-min)/2 + threshold]
The gaussian distribution is about reals, but it is used for integers, so
there is a projection on integers from the real values. The function
should compute the probability of drawing a given integer "i" in the
interval, that is given min, max and threshold, what is the probability of
drawing i.
>>> Could we simplify the equation a bit? It's needlessly difficult to
>>> realize it's actually just CDF(i+0.5) - CDF(i-0.5). I think it'd be
>>> good to first define the CDF and then just use that.
>>
>> ISTM that PHI is *the* normal CDF, which is more or less available as
>> such in various environment (matlab, python, excel...). Well, why not
>> defined the particular CDF and use it. Not sure the text would be that
>> much lighter, though.
>
> PHI is the CDF of the normal distribution, not the modified probability
> distribution here (with threshold and scaled to the desired interval).
Yep, that is exactly what I was saying, I think.
>>> This seems broken - too many sentences about the 67% and 95%.
>>
>> The point is to provide rules of thumb to describe how the distribution
>> is shaped. Any better sentence is welcome.
>
> Ah, I misread the sentence initially. I haven't realized it speaks about
> 1/threshold in the first part, and the second part is an example for
> threshold=4.0. So I thought it's a repetition of the first part.
Maybe it needs spacing and colons and rewording, if it too hard to parse.
>>> Does it make sense to explicitly mention the implementation detail
>>> (Box-Muller transform) here?
>
> No, my point was exactly the opposite - removing the mention of Box-Muller
> entirely, not adding more details about it.
Ok. I think that the fact that it relies on the Box-Muller transform is
relevant, because there are other methods to generate a gaussian
distribution, and I would say that there is no reason to have to go to the
source code to check that. But I would not provide further details. So I'm
fine with the current status.
--
Fabien.
From | Date | Subject | |
---|---|---|---|
Next Message | Zeus Kronion | 2015-10-25 21:55:43 | WIP: Fix parallel workers connection bug in pg_dump (Bug #13727) |
Previous Message | Tomas Vondra | 2015-10-25 20:33:36 | Re: pgbench gaussian/exponential docs improvements |