Re: Cheaper subquery scan not considered unless offset 0

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Benjamin Coutu <ben(dot)coutu(at)zeyos(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Cheaper subquery scan not considered unless offset 0
Date: 2017-10-29 14:58:00
Message-ID: 29740.1509289080@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Benjamin Coutu <ben(dot)coutu(at)zeyos(dot)com> writes:
> Please consider the following three semantically equivalent, but differently written queries:
> ...
> Queries A + B generate the same plan and execute as follows:

> -> Finalize HashAggregate (cost=32879.78..33102.62 rows=22285 width=12) (actual time=450.724..458.667 rows=15521 loops=1)
> Group Key: b.item
> Filter: (sum(b.amount) >= '1'::double precision)
> Rows Removed by Filter: 48277

> Plan C though, thanks to the "offset optimization fence", executes the following, more efficient plan:

> -> Subquery Scan on c (cost=32768.35..33269.76 rows=7428 width=12) (actual time=456.591..475.204 rows=15521 loops=1 total=475.204)
> Filter: (c.stock >= '1'::double precision)
> Rows Removed by Filter: 48277
> -> Finalize HashAggregate (cost=32768.35..32991.20 rows=22285 width=12) (actual time=456.582..468.124 rows=63798 loops=1 total=468.124)
> Group Key: b.item

Huh. So we can see that the grouping step produces 63798 rows in reality,
of which 15521 pass the >= filter condition. In Plan C, the planner
estimates the total number of group rows at 22285; then, having no
information about the statistics of c.stock, it uses DEFAULT_INEQ_SEL
(0.333) as the filter selectivity estimate, arriving at 7428 as the
estimated number of result rows for the subquery.

In Plan A+B, the planner presumably estimated the number of group rows at
22285 as well, but then it comes up with 22285 as the overall result.
Uh, what about the HAVING?

Evidently, the difference between 7428 and 22285 estimated rows out of
the subquery is enough to prompt a change in join plan for this query.
Since the true number is in between, it's just luck that Plan C is faster.
I don't put any great amount of stock in one join plan or the other
having been chosen for this case based on those estimates.

But ... what about the HAVING? I took a quick look around and couldn't
find anyplace where the selectivity of an aggregate's filter condition
gets accounted for, which explains this observed behavior. That seems
like a big oversight :-(

Now, it's true that we're basically never gonna be able to do better than
default selectivity estimates for post-aggregation filter conditions.
Maybe, at some point in the dim past, somebody intentionally decided that
applying the standard selectivity estimation logic to HAVING clauses was a
loser. But I don't see any comments to that effect, and anyway taking the
selectivity as 1.0 all the time doesn't seem very bright either.

Changing this in back branches might be too much of a behavioral change,
but it seems like we oughta change HEAD to apply standard selectivity
estimation to the HAVING clause.

regards, tom lane

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message patibandlakoshal 2017-10-30 22:51:06 Cursor vs Set Operation
Previous Message Benjamin Coutu 2017-10-29 14:41:20 Re: Cheaper subquery scan not considered unless offset 0