From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> |
---|---|
To: | Jeroen van Vianen <jeroen(at)design(dot)nl> |
Cc: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | Re: Implementing STDDEV and VARIANCE |
Date: | 2000-09-30 02:27:04 |
Message-ID: | 200009300227.WAA02615@candle.pha.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Jeroen, not sure if you were involved in this, but standard deviation is
in the current development tree, and will be released in 7.1 in a few
months.
> I'd like to implement stddev and variance aggregates in Postgres. This is a
> long standing TODO item.
>
> There already has been some discussion on implementing this, see
> http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
>
> There are two definitions for standard deviation and variance:
> _
> population variance = sigma^2 = SUM(X - X)^2 / N
>
> population stddev = sqrt(population variance)
> _
> sample variance = s^2 = SUM(X - X)^2 / (N-1)
>
> sample stddev = sqrt(sample variance)
>
> These statistics can be calculated in one pass when three variables are
> kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only
> two variables are kept.
>
> E.g. avg() is calculated as follows:
>
> sx = 0
> n = 0
> for every row {
> sx = sx + value in row // transition function 1
> n = n+1 // transition function 2
> }
> avg = sum(x) / n
>
> stddev / variance might be calculated as follows:
>
> sx = 0
> n = 0
> sx2 = 0
> for every row {
> sx = sx + value in row // transition function 1
> n = n+1 // transition function 2
> sx2 = sx2 + value in row^2 // transition function 3
> }
> var = (1/n) * (sx2 - (1/n) * sx^2) // Population
>
> or
>
> var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample
>
> and
>
> stddev = sqrt(var)
>
> I've looked through the code and the following things need to be implemented:
>
> 1. Add three columns to pg_aggregate for the additional third transition
> function.
>
> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >All that you need to implement this is room to keep two running
> >sums instead of one. I haven't looked at pgsql's aggregate functions,
> >but I'd hope that the working state can be a struct not just a
> >single number.
>
> I saw no other way than adding another transition function and logic, as
> this might break user-defined aggregates (are there any around?).
>
> 2. Add logic to nodeAgg.c to execute the third transition function and
> finalize function with three rather than two parameters
> 3. Add functions f(a,b) that returns a + b^2 for selected types
> 4. Add four finalize functions to calculate the variance / stddev
> 5. Update the code for create aggregate, to include the definition of the
> third transition function
> 6. Update the documentation
>
> My questions are:
> 1. Is this the correct way to continue? What am I missing? Any errors in my
> reasoning?
> 2. I am proposing the names stddev(x) and variance(x) for population and
> samplestddev(x) and
> samplevariance(x) for sample statistics. Any comments?
> 3. I'm planning to implement this for types float4, float8 and numeric. Any
> other types also? int[2,4,8] don't seem logical, as these would introduce
> serious rounding errors.
>
> Let me know what you think,
>
>
> Jeroen
>
> ************
>
--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2000-09-30 02:28:53 | 7.1 beta schedule |
Previous Message | Bruce Momjian | 2000-09-30 02:18:18 | Re: uniqueness not always correct |