Re: [HACKERS] PATCH: multivariate histograms and MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Mark Dilger <hornschnorter(at)gmail(dot)com>, Adrien Nayrat <adrien(dot)nayrat(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Date: 2019-01-10 15:45:23
Message-ID: e7d989e1-4a06-d2ab-8950-eb711a14df9c@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/10/19 4:20 PM, Dean Rasheed wrote:
> On Wed, 9 Jan 2019 at 15:40, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> On 1/8/19 3:18 PM, Dean Rasheed wrote:
>>> So actually, the estimate for a group of values will be either the MCV
>>> item's frequency (if the MCV item is kept), or (roughly) the MCV
>>> item's base_frequency (if the MCV item is not kept). That suggests
>>> that we should simply keep items that are significantly more or less
>>> common than the item's base frequency -- i.e., keep rule (b) and ditch
>>> rule (a).
>>>
>>
>> Hmmm, but won't that interfere with how we with how we extrapolate the
>> MCV estimate to the non-MCV part? Currently the patch does what you
>> proposed, i.e.
>>
>> other_sel = simple_sel - mcv_basesel;
>>
>> I'm worried that if we only include the items that are significantly
>> more or less common than the base frequency, it may skew the other_sel
>> estimate.
>>
>
> I don't see how that would skew other_sel. Items close to the base
> frequency would also tend to be close to simple_sel, making other_sel
> approximately zero, so excluding them should have little effect.

Oh, I see. You're right those items should contribute very little to
other_sel, I should have realized that.

> However...
>
> Re-reading the thread where we enhanced the per-column MCV stats last
> year [1], it was actually the case that an algorithm based on just
> looking at the relative standard error worked pretty well for a very
> wide range of data distributions.
>
> The final algorithm chosen in analyze_mcv_list() was only a marginal
> improvement on that, and was directly based upon the fact that, in the
> univariate statistics case, all the values not included in the MCV
> list are assigned the same selectivity. However, that's not the case
> for multivariate stats, because each group not included in the
> multivariate MCV list gets assigned a different selectivity based on
> its per-column stats.
>
> So perhaps what we should do for multivariate stats is simply use the
> relative standard error approach (i.e., reuse the patch in [2] with a
> 20% RSE cutoff). That had a lot of testing at the time, against a wide
> range of data distributions, and proved to be very good, not to
> mention being very simple.
>
> That approach would encompass both groups more and less common than
> the base frequency, because it relies entirely on the group appearing
> enough times in the sample to infer that any errors on the resulting
> estimates will be reasonably well controlled. It wouldn't actually
> look at the base frequency at all in deciding which items to keep.
>
> Moreover, if the group appears sufficiently often in the sample to
> justify being kept, each of the individual column values must also
> appear at least that often as well, which means that the errors on the
> base frequency estimate are also well controlled. That was one of my
> concerns about other algorithms such as "keep items significantly more
> or less common than the base frequency" -- in the less common case,
> there's no lower bound on the number of occurrences seen, and so no
> guarantee that the errors are kept under control.
>

Yep, that looks like a great approach. Simple and tested. I'll try
tweaking the patch accordingly over the weekend.

Thanks!

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-01-10 15:58:31 Re: Policy on cross-posting to multiple lists
Previous Message Dean Rasheed 2019-01-10 15:20:51 Re: [HACKERS] PATCH: multivariate histograms and MCV lists