Query planner question (7.4.5)

From: Arthur Ward <award(at)dominionsciences(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Query planner question (7.4.5)
Date: 2005-02-08 00:41:15
Message-ID: 42080AAB.4080109@dominionsciences.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was looking at some stubborn queries in one of our applications, and
not knowing the internals of the query planner, thought I might ask if
this planner improvement is possible at all. We have queries with the
general form of IN (SELECT FROM AA JOIN (SELECT foo UNION ALL SELECT
bar)) clauses. (There's a view here, but that's not the planner's
concern.) The corresponding part of the plan looks like this:

-> HashAggregate (cost=135960.26..135960.26 rows=200 width=4)
-> Nested Loop (cost=1.14..130852.49 rows=2043108 width=4)
-> Index Scan using AA_pkey on AA (cost=0.00..85.01
rows=1 width=12)
-> Subquery Scan BB (cost=1.14..110336.40 rows=2043108
width=36)
-> Append (cost=1.14..89905.32 rows=2043108
width=16)
-> Subquery Scan "*SELECT* 1"
(cost=1.14..75017.28 rows=1634954 width=16)
-> Subquery Scan "*SELECT* 2"
(cost=0.00..14888.04 rows=408154 width=16)

The problem for us is that the default estimate at the HashAggregate is
absurdly low, undercutting the other available join candidates' row
estimates resulting in _bad_ plans. What I was wondering is whether the
planner has enough information available when looking at the
HashAggregate step (the IN clause that started this whole mess) to
consider the statistics of the inputs to the Append. In our case, I know
that the second Subquery scan is generating strictly unique values, so
the HashAggregate cannot produce fewer than 408154 rows itself.
Additionally, those subquery scans are simple queries without anything
to cause the table statistics to be thrown away if they can be reused in
later steps (or is the rowcount the only stat used after a node is
created?) -- one subquery is a single-table filter, the other is a two
table join, also nothing special.

Essentially, does the data exist for the planner to say at the
HashAggregate, "a-ha! Subquery 2 estimates 400000 distinct rows!" and
estimate something >= 400000?

If this is improved in 8.0, my apologies; we haven't had the opportunity
to begin a migration, and the release notes only say there are a lot of
small improvements. If this is a special case of a long-standing
limitation, I'm not seeing the general case this fits in to, and I'll
happily sit down and be quiet. :-)

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Kirkwood 2005-02-08 01:50:41 Re: Query optimizer 8.0.1 (and 8.0)
Previous Message Joshua D. Drake 2005-02-07 23:44:42 PHP/PDO Database Abstraction Layer