Re: Choosing values for multivariate MCV lists

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Choosing values for multivariate MCV lists
Date: 2019-06-23 23:54:55
Message-ID: 20190623235455.td5yjk5rsmovwjin@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jun 22, 2019 at 04:10:52PM +0200, Tomas Vondra wrote:
>On Fri, Jun 21, 2019 at 08:50:33AM +0100, Dean Rasheed wrote:
>>On Thu, 20 Jun 2019 at 23:35, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>>
>>>On Thu, Jun 20, 2019 at 06:55:41AM +0100, Dean Rasheed wrote:
>>>
>>>>I'm not sure it's easy to justify ordering by Abs(freq-base_freq)/freq
>>>>though, because that would seem likely to put too much weight on the
>>>>least commonly occurring values.
>>>
>>>But would that be an issue, or a good thing? I mean, as long as the item
>>>is above mincount, we take the counts as reliable. As I explained, my
>>>motivation for proposing that was that both
>>>
>>> ... (cost=... rows=1 ...) (actual=... rows=1000001 ...)
>>>
>>>and
>>>
>>> ... (cost=... rows=1000000 ...) (actual=... rows=2000000 ...)
>>>
>>>have exactly the same Abs(freq - base_freq), but I think we both agree
>>>that the first misestimate is much more dangerous, because it's off by six
>>>orders of magnitude.
>>>
>>
>>Hmm, that's a good example. That definitely suggests that we should be
>>trying to minimise the relative error, but also perhaps that what we
>>should be looking at is actually just the ratio freq / base_freq,
>>rather than their difference.
>>
>
>Attached are patches that implement this (well, the first one optimizes
>how the base frequency is computed, which helps for high statistic target
>values). The 0002 part picks the items based on
>
> Max(freq/base_freq, base_freq/freq)
>
>It did help with the addresses data set quite a bit, but I'm sure it needs
>more testing. I've also tried using
>
> freq * abs(freq - base_freq)
>
>but the estimates were not as good.
>
>One annoying thing I noticed is that the base_frequency tends to end up
>being 0, most likely due to getting too small. It's a bit strange, though,
>because with statistic target set to 10k the smallest frequency for a
>single column is 1/3e6, so for 2 columns it'd be ~1/9e12 (which I think is
>something the float8 can represent).
>

FWIW while doing more tests on this, I've realized a rather annoying
behavior while increasing the statistics target.

With the current algorithm picking values merely based on frequency, the
MCV list is expanded in a stable way. Increasing the statistics target
means the MCV list may grow, but the larger MCV list will contain the
smaller MCV one (ignoring changes due to a differences in the sample).

After switching to the two-phase algorithm (first picking candidates
based on mincount, then picking the items based on error) that's no
longer true. I've repeatedly seen cases when increasing the target
lowered mincount, adding candidates with larger frequency errors,
removing some of the items from the "smaller" MCV list.

In practice this means that increasing the statistics target may easily
make the estimates much worse (for some of the values).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-06-23 23:56:36 Re: Choosing values for multivariate MCV lists
Previous Message Tomas Vondra 2019-06-23 23:42:32 Re: Choosing values for multivariate MCV lists