From: | "Nathan Boley" <npboley(at)gmail(dot)com> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "Vladimir Sitnikov" <sitnikov(dot)vladimir(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, "jd(at)commandprompt(dot)com" <jd(at)commandprompt(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: benchmarking the query planner |
Date: | 2008-12-12 02:39:56 |
Message-ID: | 6fa3b6e20812111839l29b851e7v409470aee8ed7785@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
>> Isn't a selectivity estimate of x = v as ( the number of values in v's
>> histogram bucket ) / ( number of distinct values in v's histogram
>> bucket ) pretty rational? Thats currently what we do for non-mcv
>> values, except that we look at ndistinct over the whole table instead
>> of individual histogram buckets.
>
> But the histogram buckets are (meant to be) equal-population, so it
> should come out the same either way.
Why is it the same either way? The histogram buckets have equal
population in the number of total values, not the number of distinct
value. Consider [0,0,0,0,1,2,3,4,5,6]. The histogram buckets would be
[0,2) and [2,+oo), but they would have 2 and 5 distinct values
respectively.
> The removal of MCVs from the
> population will knock any possible variance in ndistinct down to the
> point where I seriously doubt that this could offer a win.
Well, if the number of MCV's is large enough then it certainly won't
matter. But isn't that pointlessly inefficient for most distributions?
I provided some empirical evidence of this in an earlier post (
http://archives.postgresql.org/pgsql-hackers/2008-06/msg00353.php )
for normal distributions.
> An even
> bigger problem is that this requires estimation of ndistinct among
> fractions of the population, which will be proportionally less accurate
> than the overall estimate. Accurate ndistinct estimation is *hard*.
>
Agreed, but I'm not convinced that the overall error rate will go up
for our current estimator. Ndistinct estimation is hardest for
populations with a large ndistinct count variation. If the assumptions
underlying this are founded ( namely that ndistinct distributions are
related to the ordering relation in a predictable way ) then ndistinct
estimation should be easier because the ndistinct distribution will be
more uniform. But that's certainly something that should be tested on
real data.
>> now, if there are 100 histogram buckets then any values that occupy
>> more than 1% of the table will be mcv's regardless - why force a value
>> to be an mcv if it only occupies 0.1% of the table?
>
> Didn't you just contradict yourself? The cutoff would be 1% not 0.1%.
No, I'm saying that if the number of histogram buckets is 100, then
even if the mcv list is set to length 2, every value that occupies 1%
or more of the table will be an mcv. However, if the size is fixed to
100 I think that it will be common to see values with relative
frequencies much lower than 1% near the bottom of the list.
> In any case there's already a heuristic to cut off the MCV list at some
> shorter length (ie, more than 1% in this example) if it seems not
> worthwhile to keep the last entries. See lines 2132ff (in CVS HEAD)
> in analyze.c.
Given an increase in the default stats target, limits like the 25% are
exactly what I think there should be more of.
Can anyone suggest a good data set to test this sort of question on?
-Nathan
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2008-12-12 03:04:51 | Re: benchmarking the query planner |
Previous Message | Ron Mayer | 2008-12-12 02:31:08 | Re: Mostly Harmless: Welcoming our C++ friends |