From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PATCH: adaptive ndistinct estimator v4 |
Date: | 2015-04-20 05:13:37 |
Message-ID: | CAMkU=1wxHZEWGh98rg-2wq1HPbWxs6Cf_MTwHTg92gT=QOdZcQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-performance |
On Tue, Apr 14, 2015 at 11:45 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Tue, Mar 31, 2015 at 12:02 PM, Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> Hi all,
>>
>> attached is v4 of the patch implementing adaptive ndistinct estimator.
>>
>
> Hi Tomas,
>
> I have a case here where the adaptive algorithm underestimates ndistinct
> by a factor of 7 while the default estimator is pretty close.
>
> 5MB file:
>
>
> https://drive.google.com/file/d/0Bzqrh1SO9FcETU1VYnQxU2RZSWM/view?usp=sharing
>
> # create table foo2 (x text);
> # \copy foo2 from program 'bzcat ~/temp/foo1.txt.bz2'
> # analyze verbose foo2;
> INFO: analyzing "public.foo2"
> INFO: "foo2": scanned 6021 of 6021 pages, containing 1113772 live rows
> and 0 dead rows; 30000 rows in sample, 1113772 estimated total rows
> WARNING: ndistinct estimate current=998951.78 adaptive=135819.00
>
I've done a more complete analysis with a real world database I have access
to.
I've analyzed patched and current with default_statistics_target of 100 and
10000. (I also have some of the same data under 9.2, but that is not
meaningfully different than unpatched head).
For easier interpretation I hacked up the analyzer so that it just reports
the estimated number, never converting to the negative fraction.
See the spreadsheet here:
https://docs.google.com/spreadsheets/d/1qUcBoQkRFFcSDq7GtkiQkHqlLTbxQYl5hh6S0byff2M/edit?usp=sharing
The 10000 target was initially collected in an attempt to discern the truth
when the 100 target methods disagreed, but I decided to just collect the
gold-standard truth.
The truth is given by:
select count(*) from (select distinct column from schema.table where column
is not null) foo;
And number_not_null is given by:
select count(*) from schema.table where column is not null;
It looks like the proposed method sometimes overestimates, although never
by a huge amount, while the old one never overestimated. Overall it mostly
seems to be more accurate, but occasionally it does substantially worse
than the current method. I suspect most of the problems are related to the
same issue reported in the last email. There are a lot of places where
both underestimate, but where the new method does so by less than head.
If there are any columns anyone wants to examine further, give me the token
for it and I'll try to create a generator that generates data with the same
distribution (and clustering, if that seems relevant).
Cheers,
Jeff
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Nasby | 2015-04-20 05:38:09 | Re: optimizing vacuum truncation scans |
Previous Message | Peter Geoghegan | 2015-04-20 04:37:51 | Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0 |
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Nasby | 2015-04-22 16:15:12 | Re: extract(year from date) doesn't use index but maybe could? |
Previous Message | Gavin Flower | 2015-04-20 00:13:12 | Re: extract(year from date) doesn't use index but maybe could? |