From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3) |
Date: | 2015-02-19 03:08:33 |
Message-ID: | 54E553B1.6030601@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-performance |
On 23.12.2014 11:28, Heikki Linnakangas wrote:
> On 12/07/2014 03:54 AM, Tomas Vondra wrote:
>> The one interesting case is the 'step skew' with statistics_target=10,
>> i.e. estimates based on mere 3000 rows. In that case, the adaptive
>> estimator significantly overestimates:
>>
>> values current adaptive
>> ------------------------------
>> 106 99 107
>> 106 8 6449190
>> 1006 38 6449190
>> 10006 327 42441
>>
>> I don't know why I didn't get these errors in the previous runs, because
>> when I repeat the tests with the old patches I get similar results with
>> a 'good' result from time to time. Apparently I had a lucky day back
>> then :-/
>>
>> I've been messing with the code for a few hours, and I haven't
>> found any significant error in the implementation, so it seems that
>> the estimator does not perform terribly well for very small samples
>> (in this case it's 3000 rows out of 10.000.000 (i.e. ~0.03%).
>
> The paper [1] gives an equation for an upper bound of the error of this
> GEE estimator. How do the above numbers compare with that bound?
Well, that's a bit more complicated because the "Theorem 1" you mention
does not directly specify upper boundary for a single estimate. It's
formulated like this:
Assume table with "N" rows, D distinct values and sample of "r"
rows (all those values are fixed). Then there exists a dataset with
those features, so that "ratio error"
error(D, D') = max(D'/D, D/D')
is greater than f(N, r, P) with probability at least "P". I.e. if you
randomly choose a sample of 'r' rows, you'll get an error exceeding
the ratio with probability P.
So it's not not a hard limit, but speaks about probability of estimation
error with some (unknown) dataset dataset. So it describes what you can
achieve at best - if you think your estimator is better, there'll always
be a dataset hiding in the shadows hissing "Theorem 1".
Let's say we're looking for boundary that's crossed only in 1% (or 5%)
of measurements. Applying this to the sample data I posted before, i.e.
10M rows with three sample sizes 'r' (3000, 30.000 and 300.000 rows),
the ratio error boundary per the the paper is
3.000 30.000 300.000
----------------------------------------
1% 88 28 9
5% 70 22 7
At least that's what I get if I compute it using this python function:
def err(N, r, p):
return sqrt((N-r)/(2.0*r) * log(1.0/p))
So the estimates I posted before are not terribly good, I guess,
especially the ones returning 6449190. I wonder whether there really is
some stupid bug in the implementation.
regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Fujii Masao | 2015-02-19 03:29:18 | Re: pgaudit - an auditing extension for PostgreSQL |
Previous Message | Gavin Flower | 2015-02-19 02:38:28 | Re: Expanding the use of FLEXIBLE_ARRAY_MEMBER for declarations like foo[1] |
From | Date | Subject | |
---|---|---|---|
Next Message | Nicolas Paris | 2015-02-20 10:28:27 | PG 9.3 materialized view VS Views, indexes, shared memory |
Previous Message | Nico Sabbi | 2015-02-14 19:42:26 | Re: Configuration tips for very large database |