Re: Extended stats - value not in MCV list

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Pedro Luis Guzmán Hernández <peterlgh7(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Extended stats - value not in MCV list
Date: 2021-05-04 18:22:28
Message-ID: 1e48e3a0-acef-b552-adda-8f5403d6ca86@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

this topic would probably be a better fit for pgsql-hackers, as it's
about the internal implementation, but let's continue the discussion
here for now.

On 5/2/21 6:52 AM, Pedro Luis Guzmán Hernández wrote:
> Hi there,
>
> I've just started using extended stats cause the planner was giving me
> terrible estimates for a certain table. MCV extended stats solved my
> problem when values are in the extended MCV list, but estimates are
> still terrible when they are not in the MCV list. Limited as my
> knowledge of database internals is, I dug into the source code and found
> an important difference on how these not-MCV values are handled in the
> single-column and multi-column cases.
>

Hmm, so which which PostgreSQL version are you testing this on? I'm not
sure which code branch you're looking at.

Can you share a reproducer, i.e. a simple example demonstrating the
issue? That'd be very helpful.

> For single columns, the estimate is calculated as follows:
> selectivity = (1 - sum(MCV_frequencies)) / (distinct_values - count(MCV))
> Which seems to assume a uniform distribution of non-MCV values and looks
> like a sensible guess, at least to my amateur eyes.
>
> For multi-column statistics it seems to me that the estimate is
> calculated instead as:
> selectivity = 1 - sum(MCV_frequencies)
> Which instead seems to assume that the value could potentially be
> present in all the rows not covered by the MCV. This seems like an
> adequate upper bound, but is also quite conservative compared to the
> single-column estimate. In my specific case this yields a selectivity
> even higher than some of the least frequent MCV values, which is a
> condition that is actually checked for and compensated in the
> single-column estimate as an additional check. I have MCV and
> distinct extended stats, so I know the distinct_values  stats is available.
>

It shouldn't behave like that, and some quick experiments suggest it
does not (at least on master). I can't rule out a bug, of course. A
reproducer would be helpful.

> So I hope my question is clear from the above. How come the estimates
> are calculated with such different approaches? I insist I have no
> experience with database/query planner development, so excuse me if I am
> overlooking some obvious conceptual difference between single-column and
> multi-column stats. The single-column estimate is actually described in
> the documentation, but the multi-column estimate is not. If there is
> indeed a good reason for this difference, I think it should be documented.
>

As for the ndistinct estimates and multi-column MCV lists, it's not all
that simple - there may be conditions on only some of the columns, in
which case we don't know how many groups we actually matched, etc.

TBH I'm not sure how much of those implementation details we want to put
into the user docs - it may be a bit too much, and we may need to change
some of it.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Venkata B Nagothi 2021-05-04 21:49:29 Postgres upgrade 12 - issues with OIDs
Previous Message Bruce Momjian 2021-05-04 16:52:02 Re: PostgreSQL upgrade from 10 to 12 fails with "ERROR: column r.relhasoids does not exist at character 1616"