From: | Bradley Baetz <bbaetz(at)acm(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [BUGS] New hashed IN code ignores distinctiveness of subquery |
Date: | 2003-01-27 07:29:19 |
Message-ID: | 20030127072918.GA24215@mango.home |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
[Moving -> -hackers]
On Mon, Jan 27, 2003 at 12:00:08AM -0500, Tom Lane wrote:
> Bradley Baetz <bbaetz(at)acm(dot)org> writes:
> > However, its much faster (although not as fast as sticking the DISTINCT
> > in there myself), but the actual rows coming from the sort is really odd
> > - where is that number coming from? How can sorting 9 rows take 44476
> > anythings?
>
> We're back full circle to my original comment about rescans in
> mergejoin. The EXPLAIN ANALYZE instrumentation counts a re-fetch
> as another returned row.
Hmm. OK, I poked through the code a bit more, and I think I now realise
why we were talking across each other. I've attached a 'patch' which
gets the mergejoin counts down to something reasonable. (The patch also
includes a bit to fix the estimated row count for JOIN_IN, as discussed
on -bugs)
When calculating the cost for a join, if a path is a T_UniquePath, then
the reduction in the number of rows to be examined isn't taken into
account. This is why the mergejoin code was calulating the cost of
merging two 50000 tuple paths - the overestimation that you menioned
earlier isn't as important here.
For reference, bugs is a 50000 row table, and there are 9 distinct
values for product_id.
Before(with the num_rows part of the patch, though)
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3494816.98..3494816.98 rows=1 width=8) (actual
time=579.71..579.71 rows=1 loops=1)
-> Merge Join (cost=5169.41..3494691.43 rows=50218 width=8) (actual
time=111.41..530.16 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57
rows=50000 loops=1)
-> Sort (cost=920.14..920.17 rows=9 width=4) (actual
time=111.25..143.42 rows=44476 loops=1)
Sort Key: public.bugs.product_id
-> HashAggregate (cost=920.00..920.00 rows=9 width=4)
(actual time=111.17..111.18 rows=9 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.41 rows=50000 loops=1)
Total runtime: 579.84 msec
(9 rows)
After:
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=8007.21..8007.21 rows=1 width=8) (actual
time=578.16..578.16 rows=1 loops=1)
-> Merge Join (cost=5169.41..7881.67 rows=50218 width=8) (actual
time=110.94..527.79 rows=50000 loops=1)
Merge Cond: ("outer".product_id = "inner".product_id)
-> Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..250.74
rows=50000 loops=1)
-> Sort (cost=920.14..920.17 rows=9 width=4) (actual
time=110.78..142.80 rows=44476 loops=1)
Sort Key: public.bugs.product_id
-> HashAggregate (cost=920.00..920.00 rows=9 width=4)
(actual time=110.70..110.71 rows=9 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.14 rows=50000 loops=1)
Total runtime: 578.30 msec
(9 rows)
The patch isn't correct as-is, because it only covers merge joins:
bbaetz=# set enable_mergejoin=false;
SET
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4281712.05..4281712.05 rows=1 width=8) (actual
time=410.14..410.14 rows=1 loops=1)
-> Hash Join (cost=1091.00..4281586.50 rows=50218 width=8) (actual
time=126.32..362.30 rows=50000 loops=1)
Hash Cond: ("outer".product_id = "inner".product_id)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..66.81 rows=50000 loops=1)
-> Hash (cost=795.00..795.00 rows=50000 width=4) (actual
time=126.08..126.08 rows=0 loops=1)
-> Seq Scan on bugs (cost=0.00..795.00 rows=50000
width=4) (actual time=0.02..68.23 rows=50000 loops=1)
Total runtime: 410.25 msec
(7 rows)
I don't think that propogating my hack to everywhere which wants to know
how many rows are returned is a good idea though - is there a more
generic way to get the number of rows really returned by a path?
>
> regards, tom lane
Bradley
Attachment | Content-Type | Size |
---|---|---|
unq | text/plain | 1.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Key88 SF | 2003-01-27 08:58:19 | Cursor case-sensitivity |
Previous Message | Tom Lane | 2003-01-27 05:00:08 | Re: New hashed IN code ignores distinctiveness of subquery |
From | Date | Subject | |
---|---|---|---|
Next Message | Antti Haapala | 2003-01-27 07:49:17 | Re: Call for objections: put back OIDs in CREATE TABLE |
Previous Message | Christopher Kings-Lynne | 2003-01-27 06:20:37 | Re: unquoted special constants |