| From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> | 
|---|---|
| To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> | 
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: PATCH: adaptive ndistinct estimator v4 | 
| Date: | 2015-06-17 14:47:28 | 
| Message-ID: | 55818880.10300@2ndquadrant.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers pgsql-performance | 
Hi,
On 05/13/15 23:07, Jeff Janes wrote:
> With the warning it is very hard to correlate the discrepancy you do
> see with which column is causing it, as the warnings don't include
> table or column names (Assuming of course that you run it on a
> substantial database--if you just run it on a few toy cases then the
> warning works well).
That's true. I've added attnum/attname to the warning in the attached 
version of the patch.
> If we want to have an explicitly experimental patch which we want
> people with interesting real-world databases to report back on, what
> kind of patch would it have to be to encourage that to happen? Or are
> we never going to get such feedback no matter how friendly we make
> it? Another problem is that you really need to have the gold standard
> to compare them to, and getting that is expensive (which is why we
> resort to sampling in the first place). I don't think there is much
> to be done on that front other than bite the bullet and just do
> it--perhaps only for the tables which have discrepancies.
Not sure. The "experimental" part of the patch was not really aimed at 
the users outside the development community - it was meant to be used by 
members of the community, possibly testing it on customer databases I 
don't think adding the GUC into the final release is a good idea, it's 
just a noise in the config no-one would actually use.
> Some of the regressions I've seen are at least partly a bug:
>
> +   /* find the 'm' value minimizing the difference */
> +   for (m = 1; m <= total_rows; m += step)
> +   {
> +       double q = k / (sample_rows * m);
>
> sample_rows and m are both integers, and their product overflows
> vigorously. A simple cast to double before the multiplication fixes
> the first example I produced. The estimate goes from 137,177 to
> 1,108,076. The reality is 1,062,223.
>
> Perhaps m should be just be declared a double, as it is frequently
> used in double arithmetic.
Yeah, I just discovered this bug independently. There's another bug that 
the adaptive_estimator takes total_rows as integer, so it breaks for 
tables with more than INT_MAX rows. Both are fixed in the v5.
>
>         Therefore, I think that:
>
>         1. This should be committed near the beginning of a release cycle,
>         not near the end. That way, if there are problem cases, we'll have a
>         year or so of developer test to shake them out.
>
>
> It can't hurt, but how effective will it be? Will developers know or
> care whether ndistinct happened to get better or worse while they
> are working on other things? I would think that problems will be
> found by focused testing, or during beta, and probably not by
> accidental discovery during the development cycle. It can't hurt, but
> I don't know how much it will help.
I agree with that - it's unlikely the regressions will get discovered 
randomly. OTOH I'd expect non-trivial number of people on this list to 
have a few examples of ndistinct failures, and testing those would be 
more useful I guess. But that's unlikely to find the cases that worked 
OK before and got broken by the new estimator :-(
> I agree with the "experimental GUC".  That way if hackers do happen to
> see something suspicious, they can just turn it off and see what
> difference it makes.  If they have to reverse out a patch from 6 months
> ago in an area of the code they aren't particularly interested in and
> then recompile their code and then juggle two different sets of
> binaries, they will likely just shrug it off without investigation.
+1
--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
| Attachment | Content-Type | Size | 
|---|---|---|
| ndistinct-estimator-v5.patch | text/x-diff | 13.2 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Petr Jelinek | 2015-06-17 15:51:26 | Re: Sequence Access Method WIP | 
| Previous Message | Robert Haas | 2015-06-17 13:56:36 | Re: Inheritance planner CPU and memory usage change since 9.3.2 | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Sheena, Prabhjot | 2015-06-18 17:09:10 | PGBOUNCER ISSUE PLEASE HELP(Slowing down the site) | 
| Previous Message | Tomas Vondra | 2015-06-15 14:57:53 | Re: Do work_mem and shared buffers have 1g or 2g limit on 64 bit linux? |