From: | Kevin Grittner <kgrittn(at)ymail(dot)com> |
---|---|
To: | David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "amit(dot)kapila(at)enterprisedb(dot)com" <amit(dot)kapila(at)enterprisedb(dot)com>, Simon Riggs <simon(dot)riggs(at)2ndquadrant(dot)com> |
Subject: | Re: The Future of Aggregation |
Date: | 2015-06-10 13:39:28 |
Message-ID: | 1986970674.211372.1433943568062.JavaMail.yahoo@mail.yahoo.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
> On 10 June 2015 at 02:52, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
>> David Rowley <david(dot)rowley(at)2ndquadrant(dot)com> wrote:
>>> The idea I discussed in the link in item 5 above gets around this
>>> problem, but it's a perhaps more surprise filled implementation
>>> as it will mean "select avg(x),sum(x),count(x) from t" is
>>> actually faster than "select sum(x),count(x) from t" as the agg
>>> state for avg() will satisfy sum and count too.
>>
>> I'm skeptical that it will be noticeably faster. It's easy to see
>> why this optimization will make a query *with all three* faster,
>> but I would not expect the process of accumulating the sum and
>> count to be about the same speed whether performed by one
>> transition function or two. Of course I could be persuaded by a
>> benchmark showing otherwise.
Of course, after reading Tom's post and digging into what
aggregates share a transition function, I was already prepared to
eat my words above. Since the sum() aggregate is using the
xxx_avg_accum transition function, it is clearly doing the work of
accumulating the count already, so it's clear that the above can
be a win.
> Assuming that if we reuse the avg(x) state for count(x) and
> sum(x) then it will perform almost exactly like a query
> containing just avg(x), the only additional overhead is the call
> to the final functions per group, so in the following case that's
> likely immeasurable:
>
> /* setup */ create table millionrowtable as select
> generate_series(1,1000000)::numeric as x;
> /* test 1 */ SELECT sum(x) / count(x) from millionrowtable;
> /* test 2 */ SELECT avg(x) from millionrowtable;
>
> Test 1:
> 274.979 ms
> 272.104 ms
> 269.915 ms
>
> Test 2:
> 229.619 ms
> 220.703 ms
> 234.743 ms
>
> (About 19% slower)
Of course, with Tom's approach you would see the benefit; the two
statements should run at about the same speed.
I am a little curious what sort of machine you're running on,
because my i7 is much slower. I ran a few other tests with your
table for perspective.
To get the raw time to just pass the tuples:
SELECT from millionrowtable where xmin = '0';
Time: 125.340 ms
Time: 124.443 ms
Time: 115.629 ms
Just the count(*) of those rows didn't boost the time much:
SELECT count(*) from millionrowtable;
Time: 132.128 ms
Time: 128.036 ms
Time: 125.400 ms
The NULL check added by specifying count(x) boosted it more:
SELECT count(x) from millionrowtable;
Time: 165.858 ms
Time: 163.872 ms
Time: 165.448 ms
A NULL check plus numeric addition gets expensive:
SELECT sum(x) from millionrowtable;
Time: 366.879 ms
Time: 364.503 ms
Time: 365.418 ms
Since sum() and avg() use the same transition function, I was
suprised to see a difference here:
SELECT avg(x) from millionrowtable;
Time: 374.339 ms
Time: 372.294 ms
Time: 366.933 ms
Here's the statement you are talking about optimizing:
SELECT sum(x), count(x) from millionrowtable;
Time: 441.331 ms
Time: 442.501 ms
Time: 436.930 ms
To confirm that projecting the extra column compared to avg() was
not significant:
SELECT sum(x) / count(x) from millionrowtable;
Time: 442.404 ms
Time: 436.241 ms
Time: 442.381 ms
So this can reasonably be compared to the avg(x) time above.
On my machine this optimization could be expected to shave off
about 16% of current run time.
One question that arose in my mind running this was whether might
be able to combine sum(x) with count(*) if x was NOT NULL, even
though the arguments don't match. It might not be worth the
gymnastics of recognizing the special case, and I certainly
wouldn't recommend looking at that optimization in a first pass;
but it might be worth jotting down on a list somewhere....
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Noah Misch | 2015-06-10 13:41:59 | Re: [COMMITTERS] pgsql: Add pg_audit, an auditing extension |
Previous Message | Bruce Momjian | 2015-06-10 13:31:49 | Re: s_lock() seems too aggressive for machines with many sockets |