From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Statistics and selectivity estimation for ranges |
Date: | 2012-10-09 20:08:03 |
Message-ID: | CAPpHfdvPR8whLzjX70zx42Ur2_kjC2uTQW=vkPYwiTEtS=SGgA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Oct 1, 2012 at 3:22 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Tue, 2012-09-04 at 17:27 +0400, Alexander Korotkov wrote:
> > Addon patch is attached. Actually, I don't get your intention of
> > introducing STATISTIC_KIND_RANGE_EMPTY_FRAC stakind. Did you plan to
> > leave it as empty frac in distinct stakind or replace this stakind
> > with STATISTIC_KIND_LENGTH_HISTOGRAM? In the attached
> > patch STATISTIC_KIND_RANGE_EMPTY_FRAC is replaced
> > with STATISTIC_KIND_LENGTH_HISTOGRAM.
>
> Review comments:
>
> 1. In compute_range_stats, you need to guard against the case where
> there is no subdiff function. Perhaps default to 1.0 or something?
>
Let it be 1.0 without subdiff function. However, there is not so much use
of this method of estimation without subdiff. But, probably it's better
than nothing.
2. I think it would be helpful to add comments regarding what happens
> when lengths are identical, right now it's a little confusing. For
> instance, the comment: "Generate a length histogram slot entry if there
> are at least two length values" doesn't seem right, because the
> condition still matches even if there is only one distinct value.
>
I've rephrased comment. Not it's implicitly says that collected values are
not necessary distinct.
> 3. It looks like get_distance also needs to guard against a missing
> subdiff.
>
Same to compute_range_stats. Let default value be 1.0.
> 4. There are 3 binary search functions, which seems a little excessive:
> * rbound_bsearch: greatest i such that hist[i] < v; or -1
> * rbound_bsearch_equal: greatest i such that:
> hist[i] <= v and (i=0 or hist[i-1] != hist[i]); or -1
> * length_hist_bsearch: least i such that hist[i] >= v;
> or length of hist
> (let me know if I misunderstood the definitions)
> At a minimum, we need more consistent and informative names. Also, the
> definition of rbound_bsearch_equal is a little confusing because it's
> looking for the highest index among distinct values, but the lowest
> index among identical values. Do you see a way to refactor these to be a
> little easier to understand?
>
Actually, goal of rbound_bsearch_equal is to find histogram bin to start
interpolation from. I've renamed it to rbound_bsearch_bin and added
corresponding comment.
------
With best regards,
Alexander Korotkov.
Attachment | Content-Type | Size |
---|---|---|
range_stat-0.8.patch.gz | application/x-gzip | 7.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2012-10-09 20:12:04 | Re: Truncate if exists |
Previous Message | Robert Haas | 2012-10-09 20:04:40 | Re: Truncate if exists |