Re: [PATCH] Negative Transition Aggregate Functions (WIP)

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Stark <stark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject: Re: [PATCH] Negative Transition Aggregate Functions (WIP)
Date: 2014-01-10 18:08:24
Message-ID: 04CADB57-1C17-4AB0-AAF3-53971664DFAC@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Jan10, 2014, at 18:14 , Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Given that this is already the case with aggregates on floating
> point approximate numbers, why should we rule out an optimization
> which only makes rounding errors more likely to be visible? The
> real issue here is that if you are using an approximate data type
> and expecting exact answers, you will have problems.

Because without the optimization, only the values which you
*actually* process for a given result determine whether you lose
precision or not. With the optimization, OTOH, values which have
*nothing* to do with the result in question can nevertheless
make it completely bogus.

SUM() is a good example. As long as all your values are positive,
the amount of precision you lose is bound by the number of input
values. If I sum over 10 values, the worst that can happen is that
the first values is large enough to prevent the other 9 values from
influencing the result. That limits the relative error to something
like 9*epsilon, where epsilon is the relative precision of the
floating point type, i.e. 1e-15 or so for double. In other words,
as long as your frames are less than 10e13 rows long, the relative
error will stay below 1%.

But with the optimization, that is no longer true. If you sum
from, say, CURRENT ROW to UNBOUNDED FOLLOWING, the relative error
of the result in one row now depends on the magnitude of values
*preceding* that row, even though that value isn't in the frame.
And since we now internally subtract, not only add, the relative
error is no longer bound by the number of rows in the frame.

Here's the corresponding SELECT (which is basically the same
as Tom's example upthread):
select
n,
x::float,
sum(x::float) over (
order by n rows between current row and unbounded following
)
from (values
(1, 1e20),
(2, 1),
(3, 2)
) as t(n, x)
order by n;

Currently that returns
n | x | sum
---+-------+-------
1 | 1e+20 | 1e+20
2 | 1 | 3
3 | 2 | 2

but with an inverse transfer function, it may very well return
n | x | sum
---+-------+-------
1 | 1e+20 | 1e+20
2 | 1 | 0
3 | 2 | -1

> That's not to say that approximations are useless. If you
> represent the circumference of the earth with a double precision
> number you're dealing with an expected rounding error of about a
> foot. That's close enough for many purposes. The mistake is
> assuming that it will be exact or that rounding errors cannot
> accumulate. In situations where SQL does not promise particular
> ordering of operations, it should not be assumed; so any
> expectations of a specific or repeatable result from a sum or
> average of approximate numbers is misplaced.

But this isn't about ordering, it's replacing one computation
with a completely different one that just happens to be equivalent
*algebraically*.

To me, the proposed optimization for float is akin to C compiler
which decided to evaluate
a + b + c + … z
as
-a + (2a - b) + (2b - c) + … + (2y - z) + 2z
Algebraically, these are the same, but it'd still be insane to
do that.

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-01-10 18:11:32 Re: dynamic shared memory and locks
Previous Message Tom Lane 2014-01-10 18:08:01 Re: [PATCH] Negative Transition Aggregate Functions (WIP)