Re: Huge generated UNION ALL faster than JOIN?

From: Ancoron Luciferis <ancoron(dot)luciferis(at)googlemail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Huge generated UNION ALL faster than JOIN?
Date: 2019-05-12 22:23:14
Message-ID: a850f693-2167-8a2b-f720-b27852d905c4@googlemail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 12/05/2019 20:08, Tom Lane wrote:
> Ancoron Luciferis <ancoron(dot)luciferis(at)googlemail(dot)com> writes:
>> One of the things consuming most of the time was an Index Only Scan
>> executed millions of times. And on top came the Nested Loop which
>> finally reduced the rows but also took a lot of time to do so.
>
>> Explain plan: https://explain.depesz.com/s/4GYT
>
> The core problem you've got there is the misestimation of the join size:
>
> Nested Loop (cost=0.71..0.30 rows=72,839,557 width=33) (actual time=19,504.096..315,933.158 rows=274 loops=1)
>
> Anytime the planner is off by a factor of 250000x, it's not going to end
> well. In this case, it's imagining that the LIMIT will kick in after just
> a very small part of the join is executed --- but in reality, the LIMIT
> is larger than the join output, so that we have to execute the whole join.
> With a more accurate idea of the join result size, it would have chosen
> a different plan.
>
> What you ought to look into is why is that estimate so badly off.
> Maybe out-of-date stats, or you need to raise the stats target for
> one or both tables?
>

I thought so as well and that's why I started investigating, but after
creating my own data set and a final analyze of both tables I ended up
with similar difference in estimation vs. actual:

https://explain.depesz.com/s/R7jp

Nested Loop (cost=25.17..514,965,251.12 rows=27,021,979 width=56)
(actual time=0.568..5.686 rows=274 loops=1)

...but this was fast due to the Merge Append being used and pushed-down
LIMIT.

>> My question now is why can't the optimizer generate a plan that in this
>> case does 114 loops of "events" scans instead of a million loops on the
>> "subscription_signal"?
>
> I don't see any "events" table in that query, so this question isn't
> making a lot of sense to me. But in any case, the answer probably boils
> down to "it's guessing that a plan like this will stop early without
> having to scan all of the large table".
>
> regards, tom lane
>

Yes, Timescale extension is mangling the partition names quite a lot. I
wonder if it would be possible to hold the result of the estimated
smaller reference data (114 subscription_signal.signal_id entries in
this case) in a VALUES list and then use that to filter the table with
the larger estimate instead of looping over.

Cheers,

Ancoron

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Vitaly Baranovsky 2019-05-14 16:08:58 The wrong (?) query plan for queries with remote (postgres_fdw) tables
Previous Message Tom Lane 2019-05-12 18:08:01 Re: Huge generated UNION ALL faster than JOIN?