Quick Links

Re: limit in subquery causes poor selectivity estimation

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limit in subquery causes poor selectivity estimation
Date:	2011-08-31 18:09:07
Message-ID:	CA+TgmoYTSQP0F8SOP7FAL_6myT0GSYxH1WBxPO9iQ=n1KsTAqw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Aug 31, 2011 at 6:22 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On lör, 2011-08-27 at 13:32 -0400, Tom Lane wrote:
>> > EXPLAIN SELECT * FROM test1 WHERE sha1 in (SELECT sha1 FROM test2
>> LIMIT 200);
>>
>> > Here, however, it has apparently not passed this knowledge through
>> the
>> > LIMIT.
>>
>> The LIMIT prevents the subquery from being flattened entirely, ie we
>> don't have just "test1 SEMI JOIN test2" but "test1 SEMI JOIN (SELECT *
>> FROM test2 LIMIT 200)". If you look at examine_variable in selfuncs.c
>> you'll note that it punts for Vars coming from unflattened subqueries.
>>
>> > So what's up with that? Just a case of, we haven't thought about
>> > covering this case yet, or are there larger problems?
>>
>> The larger problem is that if a subquery didn't get flattened, it's
>> often because it's got LIMIT, or GROUP BY, or some similar clause that
>> makes it highly suspect whether the statistics available for the table
>> column are reasonable to use for the subquery outputs. It wouldn't be
>> that hard to grab the stats for test2.sha1, but then how do you want
>> to adjust them to reflect the LIMIT?
>
> It turns out that this is a regression introduced in 8.4.8; the same
> topic is also being discussed in
>
> http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php
>
> and
>
> http://archives.postgresql.org/pgsql-general/2011-08/msg00995.php
>
> This is the (previously posted) plan with 8.4.8:
>
> QUERY PLAN
> ----------------------------------------------------------------------------------
> Hash Join (cost=10.60..34.35 rows=500 width=31)
> Hash Cond: (test1.sha1 = test2.sha1)
> -> Seq Scan on test1 (cost=0.00..18.00 rows=1000 width=31)
> -> Hash (cost=8.10..8.10 rows=200 width=32)
> -> HashAggregate (cost=6.10..8.10 rows=200 width=32)
> -> Limit (cost=0.00..3.60 rows=200 width=21)
> -> Seq Scan on test2 (cost=0.00..18.01 rows=1001 width=21)
>
> And this is the plan with 8.4.7:
>
> QUERY PLAN
> ----------------------------------------------------------------------------------
> Hash Join (cost=10.80..34.55 rows=200 width=31)
> Hash Cond: (test1.sha1 = test2.sha1)
> -> Seq Scan on test1 (cost=0.00..18.00 rows=1000 width=31)
> -> Hash (cost=8.30..8.30 rows=200 width=32)
> -> HashAggregate (cost=6.30..8.30 rows=200 width=32)
> -> Limit (cost=0.00..3.80 rows=200 width=21)
> -> Seq Scan on test2 (cost=0.00..19.01 rows=1001 width=21)
>
> I liked the old one better. ;-)

AFAICS, those plans are identical, except for a minor difference in
the cost of scanning test2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: limit in subquery causes poor selectivity estimation at 2011-08-31 10:22:04 from Peter Eisentraut

Responses

Re: limit in subquery causes poor selectivity estimation at 2011-08-31 18:12:58 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ross J. Reedstrom	2011-08-31 18:12:33	Re: sha1, sha2 functions into core?
Previous Message	hubert depesz lubaczewski	2011-08-31 17:45:51	Re: [GENERAL] pg_upgrade problem