Extended stats - value not in MCV list

From: Pedro Luis Guzmán Hernández <peterlgh7(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Extended stats - value not in MCV list
Date: 2021-05-02 04:52:56
Message-ID: CAHKGT0aEvr7ac68b0z9dzHM8aWiKZjrfBX6hV58T0TbrAAcbVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi there,

I've just started using extended stats cause the planner was giving me
terrible estimates for a certain table. MCV extended stats solved my
problem when values are in the extended MCV list, but estimates are still
terrible when they are not in the MCV list. Limited as my knowledge of
database internals is, I dug into the source code and found an important
difference on how these not-MCV values are handled in the single-column and
multi-column cases.

For single columns, the estimate is calculated as follows:
selectivity = (1 - sum(MCV_frequencies)) / (distinct_values - count(MCV))
Which seems to assume a uniform distribution of non-MCV values and looks
like a sensible guess, at least to my amateur eyes.

For multi-column statistics it seems to me that the estimate is calculated
instead as:
selectivity = 1 - sum(MCV_frequencies)
Which instead seems to assume that the value could potentially be present
in all the rows not covered by the MCV. This seems like an adequate upper
bound, but is also quite conservative compared to the single-column
estimate. In my specific case this yields a selectivity even higher than
some of the least frequent MCV values, which is a condition that is
actually checked for and compensated in the single-column estimate as an
additional check. I have MCV and distinct extended stats, so I know the
distinct_values stats is available.

So I hope my question is clear from the above. How come the estimates are
calculated with such different approaches? I insist I have no experience
with database/query planner development, so excuse me if I am overlooking
some obvious conceptual difference between single-column and multi-column
stats. The single-column estimate is actually described in the
documentation, but the multi-column estimate is not. If there is indeed a
good reason for this difference, I think it should be documented.

Thanks,
Pedro

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dmitry Dolgov 2021-05-02 12:50:47 Re: Why is writing JSONB faster than just JSON?
Previous Message Adrian Klaver 2021-05-01 16:26:41 Re: Access a newer Version of PGDB (v13) with an older libpq (v10 x86)