From: | Mark Dilger <markdilger(at)yahoo(dot)com> |
---|---|
To: | Florian Pflug <fgp(at)phlo(dot)org>, Kevin Grittner <kgrittn(at)ymail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com> |
Subject: | Re: [PATCH] Negative Transition Aggregate Functions (WIP) |
Date: | 2014-01-10 20:31:31 |
Message-ID: | 1389385891.12124.YahooMailNeo@web125404.mail.ne1.yahoo.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
FYI, I'm using the verb "rewind" to talk about using the negative transition aggregation function to get a prior value. I don't know if this is the right verb.
Conceptually, when aggregating over floating point numbers, there is some infinitely precise theoretical value, and the computation is approximating it. Call the infinitely precise value 'r'. Call the computed value 'c', which is the result of the aggregation function. (For float4_agg and float8_agg, typeof(c) == float8).
The problem you have with rewinding an aggregate is that you don't know if you are getting the same value of c that you would have gotten from a
rescan. But if you have a type that tracks a margin [min,max] where typeof(min) == typeof(max) is higher precision than typeof(c), then you can track:
min <= r <= max
By setting the rounding mode down, then up, when computing the next value of min and max, respectively. (Extra flag bits or booleans could track whether you've encountered +inf, -inf, Nan, and any other oddball cases, with corresponding special logic that has been discussed already upthread.) In many but not all cases:
min != max
but
(typeof(c))min == (typeof(c))max
Because the margin of error is small enough not to render different values when cast to the lower precision typeof(c).
You could rewind the aggregation whenever this second case holds, and only force a rescan when it does not. This would render the same results for queries whether they were performed with rewinds or with rescans. The results might differ from older versions of postgres, but only in that they might be more accurate, with less accumulated rounding errors, owing to the higher precision state transition variable.
For many modern platforms, typeof(min)
could be __float128 using libquadmath, or something similar to that. If not available at compile time, it could be float64 instead. Even then, you'd still know that rewinding was possible when min == max and not otherwise, which is useful for cases of aggregation over exact values.
I admit I've done a bit of handwaving on the computation of the margin and the handling of floating-point rounding issues, but I believe the implementation details are tractible.
mark
On Friday, January 10, 2014 10:10 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
On Jan10, 2014, at 18:14 , Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Given that this is already the case with aggregates on floating
> point approximate numbers, why should we rule out an optimization
> which only makes rounding errors more likely to be visible? The
> real issue here is that if you are using an approximate data type
> and expecting exact answers, you will have problems.
Because without the optimization, only the values which you
*actually* process for a given result determine whether you lose
precision or not. With the optimization, OTOH, values which have
*nothing* to do with the result in question can nevertheless
make it completely bogus.
SUM() is a good example. As long as all your values are positive,
the amount of precision you lose is bound by the number of input
values. If I sum over 10 values, the worst that can happen is that
the first values is large enough to prevent the other 9 values from
influencing the result. That limits the relative error to something
like 9*epsilon, where epsilon is the relative precision of the
floating point type, i.e. 1e-15 or so for double. In other words,
as long as your frames are less than 10e13 rows long, the relative
error will stay below 1%.
But with the optimization, that is no longer true. If you sum
from, say, CURRENT ROW to UNBOUNDED FOLLOWING, the relative error
of the result in one row now depends on the magnitude of values
*preceding* that row, even though that value isn't in the frame.
And since we now internally subtract, not only add, the relative
error is no longer bound by the number of rows in the frame.
Here's the corresponding SELECT (which is basically the same
as Tom's example upthread):
select
n,
x::float,
sum(x::float) over (
order by n rows between current row and unbounded following
)
from (values
(1, 1e20),
(2, 1),
(3, 2)
) as t(n, x)
order by n;
Currently that returns
n | x | sum
---+-------+-------
1 | 1e+20 | 1e+20
2 | 1 | 3
3 | 2 | 2
but with an inverse transfer function, it may very well return
n | x | sum
---+-------+-------
1 | 1e+20 | 1e+20
2 | 1 | 0
3 | 2 | -1
> That's not to say that approximations are
useless. If you
> represent the circumference of the earth with a double precision
> number you're dealing with an expected rounding error of about a
> foot. That's close enough for many purposes. The mistake is
> assuming that it will be exact or that rounding errors cannot
> accumulate. In situations where SQL does not promise particular
> ordering of operations, it should not be assumed; so any
> expectations of a specific or repeatable result from a sum or
> average of approximate numbers is misplaced.
But this isn't about ordering, it's replacing one computation
with a completely different one that just happens to be equivalent
*algebraically*.
To me, the proposed optimization
for float is akin to C compiler
which decided to evaluate
a + b + c + … z
as
-a + (2a - b) + (2b - c) + … + (2y - z) + 2z
Algebraically, these are the same, but it'd still be insane to
do that.
best regards,
Florian Pflug
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2014-01-10 21:02:02 | Re: [COMMITTERS] pgsql: Upgrade to Autoconf 2.69 |
Previous Message | Peter Geoghegan | 2014-01-10 20:00:07 | Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE |