| From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> | 
|---|---|
| To: | pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: statistics for array types | 
| Date: | 2015-08-21 01:00:00 | 
| Message-ID: | 55D67810.2010709@2ndquadrant.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 08/11/2015 04:38 PM, Jeff Janes wrote:
> When reviewing some recent patches, I decided the statistics gathered
>  for arrays had some pre-existing shortcomings.
>
> The main one is that when the arrays contain rare elements there is
> no histogram to fall back upon when the MCE array is empty, the way
> there is for scalar stats.  So it has to punt completely and resort
> to saying that it is 0.5% selectivity without recourse to any data at
> all.
>
> The rationale for applying the threshold before things are eligible
> for inclusion in the MCE array seems to be that this puts some
> theoretical bound on the amount of error we are likely to have in
> that element.  But I think it is better to exceed that theoretical
> bound than it is to have no data at all.
>
> The attached patch forces there to be at least one element in MCE,
> keeping the one element with the highest predicted frequency if the
> MCE would otherwise be empty.  Then any other element queried for is
> assumed to be no more common than this most common element.
We only really need the frequency, right? So do we really need to keep
the actual MCV element? I.e. most_common_elem_freqs does not have the
same number of values as most_common_elems anyway:
   A list of the frequencies of the most common element values, i.e., the
   fraction of rows containing at least one instance of the given value.
   Two or three additional values follow the per-element frequencies;
   these are the minimum and maximum of the preceding per-element
   frequencies, and optionally the frequency of null elements.
   (Null when most_common_elems is.)
So we might modify it so that it's always defined - either it tracks the
same values as today (when most_common_elems is defined), or the
frequency of the most common element (when most_common_elems is NULL).
This way we can keep the current theoretical error-bound on the MCE
frequencies, and if that's not possible we can have at least the new
value without confusing existing code.
> I'd also briefly considered just having the part of the code that
> pulls the stats out of pg_stats interpret a MCE array as meaning
> that nothing is more frequent than the threshold, but that would mean
> that that part of the code needs to know about how the threshold is
> chosen, which just seems wrong. And it would need to know the
> difference between NULL MCE because no stats were gathered, versus
> because stats were gathered but nothing met the threshold.
I'm not sure whether this is the same thing I just proposed ...
regards
--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Michael Paquier | 2015-08-21 01:01:29 | Re: [PATCH] postgres_fdw extension support | 
| Previous Message | Kouhei Kaigai | 2015-08-21 00:28:41 | Re: DBT-3 with SF=20 got failed |