Re: Understanding bad estimate (related to FKs?)

From: Philip Semanchuk <philip(at)americanefficient(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Michael Lewis <mlewis(at)entrata(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Understanding bad estimate (related to FKs?)
Date: 2020-10-30 14:56:54
Message-ID: 44E8D04E-A362-409B-AD2B-1D5860D01B61@americanefficient.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote:
>>
>>
>>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pryzby(at)telsasoft(dot)com>
>>> wrote:
>>>
>>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:
>>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk
>>>> <philip(at)americanefficient(dot)com> wrote:
>>>>
>>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows,
>>>>>>> actual
>>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in
>>>>> this node.
>>>>>>>
>>>>>>> The first is Postgres' estimate. The condition for this index
>>>>>>> scan
>>>>> contains three expressions --
>>>>>>>
>>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike =
>>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text)
>>>>>
>>>>
>>>> Are the columns correlated?
>>>
>>> I guess it shouldn't matter, since the FKs should remove all but one
>>> of the conditions.
>>
>> Yes, I had the same expectation. I thought Postgres would calculate the
>> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency
>> of ‘mike’, but since the frequency estimate is very accurate but the
>> planner’s estimate is not, there’s something else going on.
>>
>
> Well, this is quite a bit more complicated, I'm afraid :-( The clauses
> include parameters passed from the nodes above the index scan. So even
> if we had extended stats on the table, we couldn't use them as that
> requires (Var op Const) conditions. So this likely ends up with a
> product of estimates for each clause, and even then we can't use any
> particular value so we probably end up with something like 1/ndistinct
> or something like that. So if the values actually passed to the index
> scan are more common and/or if the columns are somehow correlated, it's
> not surprising we end up with an overestimate.

I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in the source table, so we have:

25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209

Hence the estimate of 7 rows returned.

It's interesting that five_uniform’s estimated ndistinct is low by > 50% (actual = 62958). Paradoxically, if I manually set ndistinct to the correct value of 62958, the estimate gets worse (3 rows instead of 7).

Suggestions for fixing this are of course welcome. :-)

On a related topic, are there any in depth guides to the planner that I could read? I can (and have) read the source code and it’s been informative, but something higher level than the source code would help.

Thanks so much
Philip

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Guillaume Lelarge 2020-10-31 13:53:48 Re: Understanding bad estimate (related to FKs?)
Previous Message Tomas Vondra 2020-10-29 22:48:45 Re: Understanding bad estimate (related to FKs?)