Re: Understanding bad estimate (related to FKs?)

From: Guillaume Lelarge <guillaume(at)lelarge(dot)info>
To: Philip Semanchuk <philip(at)americanefficient(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Michael Lewis <mlewis(at)entrata(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Understanding bad estimate (related to FKs?)
Date: 2020-10-31 13:53:48
Message-ID: CAECtzeUcBHVv0DgazyufH2r57=gsdC-hGObKLCTRFwKFSKZezQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

Le ven. 30 oct. 2020 à 15:57, Philip Semanchuk <philip(at)americanefficient(dot)com>
a écrit :

>
>
> > On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
> wrote:
> >
> > On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote:
> >>
> >>
> >>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pryzby(at)telsasoft(dot)com>
> >>> wrote:
> >>>
> >>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:
> >>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk
> >>>> <philip(at)americanefficient(dot)com> wrote:
> >>>>
> >>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows,
> >>>>>>> actual
> >>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in
> >>>>> this node.
> >>>>>>>
> >>>>>>> The first is Postgres' estimate. The condition for this index
> >>>>>>> scan
> >>>>> contains three expressions --
> >>>>>>>
> >>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike =
> >>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text)
> >>>>>
> >>>>
> >>>> Are the columns correlated?
> >>>
> >>> I guess it shouldn't matter, since the FKs should remove all but one
> >>> of the conditions.
> >>
> >> Yes, I had the same expectation. I thought Postgres would calculate the
> >> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency
> >> of ‘mike’, but since the frequency estimate is very accurate but the
> >> planner’s estimate is not, there’s something else going on.
> >>
> >
> > Well, this is quite a bit more complicated, I'm afraid :-( The clauses
> > include parameters passed from the nodes above the index scan. So even
> > if we had extended stats on the table, we couldn't use them as that
> > requires (Var op Const) conditions. So this likely ends up with a
> > product of estimates for each clause, and even then we can't use any
> > particular value so we probably end up with something like 1/ndistinct
> > or something like that. So if the values actually passed to the index
> > scan are more common and/or if the columns are somehow correlated, it's
> > not surprising we end up with an overestimate.
>
> I appreciate the insight. 1/ndistinct is exactly right. In pg_stats,
> five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The
> estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in
> the source table, so we have:
>
> 25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209
>
> Hence the estimate of 7 rows returned.
>
> It's interesting that five_uniform’s estimated ndistinct is low by > 50%
> (actual = 62958). Paradoxically, if I manually set ndistinct to the correct
> value of 62958, the estimate gets worse (3 rows instead of 7).
>
> Suggestions for fixing this are of course welcome. :-)
>
> On a related topic, are there any in depth guides to the planner that I
> could read? I can (and have) read the source code and it’s been
> informative, but something higher level than the source code would help.
>
>
You may already know this, but there's a bunch of documents up there:
https://wiki.postgresql.org/wiki/Using_EXPLAIN

I'm also working on a project to better document this. I'm just at the
beginning, writing it all, in english (which isn't my native language), so
it takes time. I already have most of it in french in various
documents/formats, but it takes time to go through all of these, summarize
them, and translate them. Anyway, work in progress as they say. You can
have a look at it there:
https://pgplanner.readthedocs.io/en/latest/index.html. Any comment/help is
very welcome.

--
Guillaume.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Philip Semanchuk 2020-11-02 19:09:03 Re: Understanding bad estimate (related to FKs?)
Previous Message Philip Semanchuk 2020-10-30 14:56:54 Re: Understanding bad estimate (related to FKs?)