Re: MCV lists for highly skewed distributions

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: John Naylor <jcnaylor(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MCV lists for highly skewed distributions
Date: 2018-02-07 15:20:26
Message-ID: CAEZATCUy3Fp=8TS16rSOwRrDTtuWT8OEk-m4Zgg-Y1oqNaZ+-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7 February 2018 at 13:29, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>> Thanks for testing. I agree, this new algorithm seems to stand up
>> pretty well in the testing I've done too. One thing about it that can
>> be tuned is the cutoff threshold for the relative standard error -- I
>> chose 10% completely arbitrarily, but it could just as easily be 20%,
>> 30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).
>
> Maybe it'd be worth having a separate GUC for this, and a reloption to
> override the GUC. It seems to me that it would be very reasonable to
> want to separately control (a) how much sampling you want to do and
> (b) how aggressively you want to be about including things in the MCV
> list. Of course, even if we do that, it doesn't excuse us from
> needing to set a good default value. And it might not be necessary to
> do the extra work for this.
>
> Looking at your data, it definitely seems like 10% would be too strict
> -- if I'm reading this correctly, with a stats target in the 10-50
> range, your normally-distributed table gets no MCVs at all, rather
> than a number equal to the statistics target. That doesn't seem good.
>

(Actually it was the 10-30 stats target range that gave no MCVs at all for 10%)

The fact that 10% leads to no MCVs for a deliberately lowered stats
target of 10 isn't necessarily bad, and might actually have been an
improvement -- e.g., with a stats target of 1, you get no MCVs even
with a 20% RSE cutoff, whereas with HEAD you typically get 1
completely random MCV, and a wildly inaccurate estimate for that
value.

The reason I think 10% is too low is that the extra MCVs you get by
increasing it to 20% generally seem to give better estimates (for a
range of stats targets) than if they weren't included (although it's
sometimes a bit marginal). So 20% seems to strike about the right
balance between too many and too few MCVs.

One thing this new algorithm does do is improve the user's ability to
get more MCVs by increasing the stats target. I'm not yet convinced
there should be a separate knob for the RSE cutoff. For that to be
useful, there would need to be some data distributions for which 10%
(say) was clearly better, and some for which 20% was better. So far,
there doesn't appear to be a massive difference between the two, and
it's nothing that can't compensated for using the existing stats
target knob.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-02-07 15:25:04 Re: MCV lists for highly skewed distributions
Previous Message David Steele 2018-02-07 15:14:18 Re: Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)