From: | John Naylor <jcnaylor(at)gmail(dot)com> |
---|---|
To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | stricter MCV tests for uniform distributions (was Re: MCV lists for highly skewed distributions) |
Date: | 2018-01-22 08:07:35 |
Message-ID: | CAJVSVGVzZgYV3dHOnnRgLdW9B0mocyTgwWvDfMK7rO8NynwO4g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
(Starting a new thread so as not to distract review)
On 1/21/18, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> On 21 January 2018 at 07:26, John Naylor <jcnaylor(at)gmail(dot)com> wrote:
>> I spent a few hours hacking on this, and it turns out calculating the
>> right number of MCVs taking into account both uniform and highly
>> non-uniform distributions is too delicate a problem for me to solve
>> right now. The logic suggested by Dean Rasheed in [1] always produces
>> no MCVs for a perfectly uniform distribution (which is good), but very
>> often also for other distributions, which is not good. My efforts to
>> tweak that didn't work, so I didn't get as far as adapting it for the
>> problem Jeff is trying to solve.
>
> Hmm, Tom suggested that the test based on the average frequency over
> all values might be too strict because the estimated number of
> distinct values is often too low, so that might explain what you're
> seeing.
In my test tables, I've noticed that our Ndistinct estimator is most
inaccurate for geometric distributions, so that's certainly possible,
but confusingly, it occasionally gave an empty MCV list along with a
histogram with a boundary duplicated 5 times, which I thought I was
guarding against. I'm thinking my implementation of your logic is
flawed somehow. In case you're curious I've attached my rough
(complier warnings and all) test patch.
> It occurs to me that maybe a better test to exclude a value from the
> MCV list would be to demand that its relative standard error not be
> too high. Such a test, in addition to the existing tests, might be
> sufficient to solve the opposite problem of too many values in the MCV
> list, because the real problem there is including a value after having
> seen relatively few occurrences of it in the sample, and thus having a
> wildly inaccurate estimate for it. Setting a bound on the relative
> standard error would mean that we could have a reasonable degree of
> confidence in estimates produced from the sample.
If you don't mind, what would the math look like for that?
-John Naylor
Attachment | Content-Type | Size |
---|---|---|
SE_test_for_MCV_list.patch | text/x-patch | 4.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2018-01-22 08:14:29 | Re: PATCH: Exclude unlogged tables from base backups |
Previous Message | Amit Khandekar | 2018-01-22 07:44:51 | Re: [HACKERS] UPDATE of partition key |