From: | Petr Jelinek <petr(at)2ndquadrant(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Built-in binning functions |
Date: | 2014-08-31 19:55:10 |
Message-ID: | 54037D9E.1080904@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 30/08/14 19:24, Tom Lane wrote:
> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
>> 1. I am thinking so reduction to only numeric types is not necessary -
>> although we can live without it - but there are lot of non numeric
>> categories: chars, date, ...
>
> I wasn't terribly happy about that either. I still think we should
> reduce this to a single polymorphic function, as in the attached.
>
I did try to write generic function very similar to what you wrote but
discarded it because of the performance reasons. If we really want to
support any data type I am all for having generic function but I still
would like to have one optimized for float8 because that seems to be the
most used use-case (at least that one is the reason why I even wrote
this) for performance reasons.
Here are some numbers:
First float8:
CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROM
generate_series(1,1000000) ORDER BY 1;
SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[])
FROM wbtest;
Optimized float8: ~250ms
Tom's generic: ~400ms
Numeric:
CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROM
generate_series(1,1000000) ORDER BY 1;
SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[])
FROM wbtestn;
Optimized numeric: ~600ms
My generic: ~780ms
Tom's generic: ~900ms
The difference between my generic and Tom's generic is because Tom's is
slowed down by the deconstruct_array.
>> 2. Still I strongly afraid about used searching method - there is not
>> possible to check a validity of input. Did you check how much linear
>> searching is slower - you spoke, so you have a real data and real use case?
>
> Well, if we check the input then we'll be doing O(N) comparisons instead
> of O(log N). That could be a serious cost penalty for large arrays and
> expensive comparison functions (such as strcoll()). I think it's probably
> sufficient to have a clear warning in the docs.
>
With resort the performance is worse than bunch of CASE WHEN :(
>
> Also, a documentation quibble: in Petr's patch the documentation and
> comments refer to the thresholds as "right bounds", which I didn't care
> for and replaced with "upper bounds". However, looking at it again,
> I wonder if we would not be better off describing them as "lower bounds".
> They are exclusive bounds if seen as upper bounds, and inclusive if seen
> as lower bounds, and on the whole I think the latter viewpoint might be
> less confusing. Thoughts?
Upper bounds is probably better name than right bounds, I agree with
that. In any case it's upper bound for a bucket (that's why there is one
more bucket to which everything bigger than max threshold goes into).
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2014-08-31 20:03:30 | Re: On partitioning |
Previous Message | Gavin Flower | 2014-08-31 19:44:59 | Re: Built-in binning functions |