IN not handled very well?

From: Ben <bench(at)silentmedia(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: IN not handled very well?
Date: 2006-09-24 17:12:23
Message-ID: 37536ADC-0D18-485C-9F77-B875D69FBCAA@silentmedia.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I've got this query with an IN clause:

select count(*),public.album.gid,public.album.name,public.album.id
from public.album,public.albumjoin,public.puid,public.puidjoin where
albumjoin.album = public.album.id and public.puidjoin.track =
public.albumjoin.track and public.puid.id = public.puidjoin.puid and
public.puid.puid
IN (select umdb.puid.name from umdb.puid,umdb.node where umdb.puid.id
= umdb.node.puid and umdb.node.dir=5886)
group by gid,name,public.album.id having count(*) >= 6 order by
count(*) desc;

It gives me a rather expensive plan:

Sort (cost=35729.07..35729.75 rows=272 width=69)
Sort Key: count(*)
-> HashAggregate (cost=35713.31..35718.07 rows=272 width=69)
Filter: (count(*) >= 6)
-> Nested Loop (cost=51.67..35709.91 rows=272 width=69)
-> Nested Loop (cost=51.67..34216.30 rows=272 width=4)
-> Nested Loop (cost=51.67..33338.04 rows=272
width=4)
-> Hash IN Join (cost=51.67..31794.72
rows=218 width=4)
Hash Cond: (("outer".puid)::text =
"inner".name)
-> Seq Scan on puid
(cost=0.00..23495.21 rows=1099421 width=44)
-> Hash (cost=51.63..51.63
rows=15 width=40)
-> Nested Loop
(cost=0.00..51.63 rows=15 width=40)
-> Index Scan using
node_dir on node (cost=0.00..3.22 rows=16 width=4)
Index Cond: (dir
= 5886)
-> Index Scan using
puid_pkey on puid (cost=0.00..3.01 rows=1 width=44)
Index Cond:
(puid.id = "outer".puid)
-> Index Scan using puidjoin_puidtrack
on puidjoin (cost=0.00..7.05 rows=2 width=8)
Index Cond: ("outer".id =
puidjoin.puid)
-> Index Scan using albumjoin_trackindex on
albumjoin (cost=0.00..3.22 rows=1 width=8)
Index Cond: ("outer".track =
albumjoin.track)
-> Index Scan using album_pkey on album
(cost=0.00..5.48 rows=1 width=69)
Index Cond: ("outer".album = album.id)

If I'm reading this right, it looks like it's expensive because it's
doing a sequential scan on public.puid.puid to satisfy the IN clause.
(Although why it's doing that I'm not sure, given that there's a
recently analyzed index on public.puid.puid.) Interestingly, if I
replace that IN subselect with the 15 values it will return, my plan
improves by two orders of magnitude:

Sort (cost=235.53..235.56 rows=12 width=69)
Sort Key: count(*)
-> HashAggregate (cost=235.11..235.32 rows=12 width=69)
Filter: (count(*) >= 6)
-> Nested Loop (cost=20.03..234.96 rows=12 width=69)
-> Nested Loop (cost=20.03..169.06 rows=12 width=4)
-> Nested Loop (cost=20.03..130.32 rows=12
width=4)
-> Bitmap Heap Scan on puid
(cost=20.03..59.52 rows=10 width=4)
Recheck Cond: ((puid =
'f68dcf86-992c-2e4a-21fb-2fc8c56edfeb'::bpchar) OR (puid =
'7716dbcf-56ab-623b-ab33-3b2e67a0727c'::bpchar) OR (puid =
'724d6a39-0d15-a296-2dd2-127c34f13809'::bpchar) OR (puid =
'02f0cd9f-9fa5-abda-06cd-5dbb13826243'::bpchar) OR (puid = '165d5bea-
b21f-9302-b991-0927f491787b'::bpchar) OR (puid = '4223dbc8-85af-a92e-
b63d-72a726475e2c'::bpchar) OR (puid = '2d43ef9a-
c7ee-2425-7fac-8b937cbed178'::bpchar) OR (puid = '9ff81c2f-04b7-
cf5d-705f-7b944a5ae093'::bpchar) OR (puid = 'deaddddd-dfaf-18dd-6d4d-
c483e8ba60f7'::bpchar) OR (puid = '20939b69-
ff98-770a-1444-3b0e9892712f'::bpchar))
-> BitmapOr (cost=20.03..20.03
rows=10 width=0)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'f68dcf86-992c-2e4a-21fb-2fc8c56edfeb'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'7716dbcf-56ab-623b-ab33-3b2e67a0727c'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'724d6a39-0d15-a296-2dd2-127c34f13809'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'02f0cd9f-9fa5-abda-06cd-5dbb13826243'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'165d5bea-b21f-9302-b991-0927f491787b'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'4223dbc8-85af-a92e-b63d-72a726475e2c'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'2d43ef9a-c7ee-2425-7fac-8b937cbed178'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'9ff81c2f-04b7-cf5d-705f-7b944a5ae093'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'deaddddd-dfaf-18dd-6d4d-c483e8ba60f7'::bpchar)
-> Bitmap Index Scan on
puid_puidindex (cost=0.00..2.00 rows=1 width=0)
Index Cond: (puid =
'20939b69-ff98-770a-1444-3b0e9892712f'::bpchar)
-> Index Scan using puidjoin_puidtrack
on puidjoin (cost=0.00..7.05 rows=2 width=8)
Index Cond: ("outer".id =
puidjoin.puid)
-> Index Scan using albumjoin_trackindex on
albumjoin (cost=0.00..3.22 rows=1 width=8)
Index Cond: ("outer".track =
albumjoin.track)
-> Index Scan using album_pkey on album
(cost=0.00..5.48 rows=1 width=69)
Index Cond: ("outer".album = album.id)

I guess my question is: if postgres is (correctly) estimating that
only 15 rows will come out of the subselect, and it knows it can
choose a much better plan with bitmap index scans, should it be able
to choose the bitmap plan over the sequential scan? Or should I run
the subselect myself and then rewrite my query to push in the
constant values?

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2006-09-24 17:57:54 Re: IN not handled very well?
Previous Message Ron Mayer 2006-09-24 13:29:50 Re: Large tables (was: RAID 0 not as fast as