Quick Links

Re: Statistics and selectivity estimation for ranges

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Statistics and selectivity estimation for ranges
Date:	2012-10-09 20:08:03
Message-ID:	CAPpHfdvPR8whLzjX70zx42Ur2_kjC2uTQW=vkPYwiTEtS=SGgA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Oct 1, 2012 at 3:22 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:

> On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:
> > Addon patch is attached. Actually, I don't get your intention of
> > introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
> > leave it as empty frac in distinct stakind or replace this stakind
> > with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
> > patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
> > with STATISTIC_KIND_LENGTH_HISTOGRAM.
>
> Review comments:
>
> 1. In compute_range_stats, you need to guard against the case where
> there is no subdiff function. Perhaps default to 1.0 or something?
>

Let it be 1.0 without subdiff function. However, there is not so much use
of this method of estimation without subdiff. But, probably it's better
than nothing.

2. I think it would be helpful to add comments regarding what happens
> when lengths are identical, right now it's a little confusing. For
> instance, the comment: "Generate a length histogram slot entry if there
> are at least two length values" doesn't seem right, because the
> condition still matches even if there is only one distinct value.
>
I've rephrased comment. Not it's implicitly says that collected values are
not necessary distinct.

> 3. It looks like get_distance also needs to guard against a missing
> subdiff.
>

Same to compute_range_stats. Let default value be 1.0.

> 4. There are 3 binary search functions, which seems a little excessive:
> * rbound_bsearch: greatest i such that hist[i] < v; or -1
> * rbound_bsearch_equal: greatest i such that:
> hist[i] <= v and (i=0 or hist[i-1] != hist[i]); or -1
> * length_hist_bsearch: least i such that hist[i] >= v;
> or length of hist
> (let me know if I misunderstood the definitions)
> At a minimum, we need more consistent and informative names. Also, the
> definition of rbound_bsearch_equal is a little confusing because it's
> looking for the highest index among distinct values, but the lowest
> index among identical values. Do you see a way to refactor these to be a
> little easier to understand?
>

Actually, goal of rbound_bsearch_equal is to find histogram bin to start
interpolation from. I've renamed it to rbound_bsearch_bin and added
corresponding comment.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
range_stat-0.8.patch.gz	application/x-gzip	7.1 KB

In response to

Re: Statistics and selectivity estimation for ranges at 2012-09-30 23:22:01 from Jeff Davis

Responses

Re: Statistics and selectivity estimation for ranges at 2012-10-18 18:12:51 from Alvaro Herrera
Re: Statistics and selectivity estimation for ranges at 2012-11-05 14:12:38 from Alvaro Herrera
Re: Statistics and selectivity estimation for ranges at 2012-12-10 19:21:44 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Simon Riggs	2012-10-09 20:12:04	Re: Truncate if exists
Previous Message	Robert Haas	2012-10-09 20:04:40	Re: Truncate if exists