Quick Links

Re: Understanding bad estimate (related to FKs?)

From:	Philip Semanchuk <philip(at)americanefficient(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Justin Pryzby <pryzby(at)telsasoft(dot)com>, Michael Lewis <mlewis(at)entrata(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Understanding bad estimate (related to FKs?)
Date:	2020-10-30 14:56:54
Message-ID:	44E8D04E-A362-409B-AD2B-1D5860D01B61@americanefficient.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

> On Oct 29, 2020, at 6:48 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Thu, Oct 29, 2020 at 11:25:48AM -0400, Philip Semanchuk wrote:
>>
>>
>>> On Oct 28, 2020, at 9:13 PM, Justin Pryzby <pryzby(at)telsasoft(dot)com>
>>> wrote:
>>>
>>> On Mon, Oct 26, 2020 at 11:20:01AM -0600, Michael Lewis wrote:
>>>> On Mon, Oct 26, 2020 at 11:14 AM Philip Semanchuk
>>>> <philip(at)americanefficient(dot)com> wrote:
>>>>
>>>>>>> The item I'm focused on is node 23. The estimate is for 7 rows,
>>>>>>> actual
>>>>> is 896 (multiplied by 1062 loops). I'm confused about two things in
>>>>> this node.
>>>>>>>
>>>>>>> The first is Postgres' estimate. The condition for this index
>>>>>>> scan
>>>>> contains three expressions --
>>>>>>>
>>>>>>> (five_uniform = zulu_five.five_uniform) AND (whiskey_mike =
>>>>>>> juliet_india.whiskey_mike) AND (bravo = 'mike'::text)
>>>>>
>>>>
>>>> Are the columns correlated?
>>>
>>> I guess it shouldn't matter, since the FKs should remove all but one
>>> of the conditions.
>>
>> Yes, I had the same expectation. I thought Postgres would calculate the
>> selectivity as 1.0 * 1.0 * whatever estimate it has for the frequency
>> of ‘mike’, but since the frequency estimate is very accurate but the
>> planner’s estimate is not, there’s something else going on.
>>
>
> Well, this is quite a bit more complicated, I'm afraid :-( The clauses
> include parameters passed from the nodes above the index scan. So even
> if we had extended stats on the table, we couldn't use them as that
> requires (Var op Const) conditions. So this likely ends up with a
> product of estimates for each clause, and even then we can't use any
> particular value so we probably end up with something like 1/ndistinct
> or something like that. So if the values actually passed to the index
> scan are more common and/or if the columns are somehow correlated, it's
> not surprising we end up with an overestimate.

I appreciate the insight. 1/ndistinct is exactly right. In pg_stats, five_uniform’s ndistinct = 26326, and whiskey_mike’s ndistinct = 3. The estimated frequency of bravo = ‘mike’ is .02228. There are 25156157 rows in the source table, so we have:

25156157 * (1/26326.0) * (1/3.0) * .02228 = 7.0966494209

Hence the estimate of 7 rows returned.

It's interesting that five_uniform’s estimated ndistinct is low by > 50% (actual = 62958). Paradoxically, if I manually set ndistinct to the correct value of 62958, the estimate gets worse (3 rows instead of 7).

Suggestions for fixing this are of course welcome. :-)

On a related topic, are there any in depth guides to the planner that I could read? I can (and have) read the source code and it’s been informative, but something higher level than the source code would help.

Thanks so much
Philip

In response to

Re: Understanding bad estimate (related to FKs?) at 2020-10-29 22:48:45 from Tomas Vondra

Responses

Re: Understanding bad estimate (related to FKs?) at 2020-10-31 13:53:48 from Guillaume Lelarge

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Guillaume Lelarge	2020-10-31 13:53:48	Re: Understanding bad estimate (related to FKs?)
Previous Message	Tomas Vondra	2020-10-29 22:48:45	Re: Understanding bad estimate (related to FKs?)