Confusion about composite indexes

From: Bill Mitchell <bill(at)publicrelay(dot)com>
To: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Confusion about composite indexes
Date: 2012-05-21 19:34:42
Message-ID: 4FBA98D2.9030509@publicrelay.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I am searching for some logic behind the selection of an index in
postgres -- it seems that if I have a composite index based on both
columns in a join table, it's only referenced if I query on the first
term in the composite index. I've read
http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over
and over and think that this is the same scenario as what I face.

As an example:
OUTLET: has OUTLET_ID as a primary key, consisting of about a million rows
MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
OUTLET_MEDIA: a join table used to correlate, and this has about a
million rows

Each outlet may have 1+ Media (technically, 0 or more, in this schema)

Table "public.outlet_media"
Column | Type | Modifiers
-----------+--------+-----------
outlet_id | bigint | not null
media_id | bigint | not null
Indexes:
"outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
Foreign-key constraints:
"fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
"fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
outlet(outlet_id)

When I test performance, using an OUTLET_ID, the query uses the
outlet_media_pkey index

# explain analyze select * from outlet_media where outlet_id in (select
outlet_id from outlet order by random() limit 50);
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=67625.64..68048.50 rows=50 width=16) (actual
time=841.115..884.669 rows=50 loops=1)
-> HashAggregate (cost=67625.64..67626.14 rows=50 width=8) (actual
time=841.048..841.090 rows=50 loops=1)
-> Limit (cost=67624.89..67625.01 rows=50 width=8) (actual
time=840.980..841.011 rows=50 loops=1)
-> Sort (cost=67624.89..70342.66 rows=1087110 width=8)
(actual time=840.978..840.991 rows=50 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 27kB
-> Seq Scan on outlet (cost=0.00..31511.88
rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
-> Index Scan using outlet_media_pkey on outlet_media
(cost=0.00..8.43 rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
Index Cond: (outlet_id = outlet.outlet_id)
Total runtime: 884.759 ms
(10 rows)

However if I try the reverse, to search using the MEDIA_ID
# explain analyze select * from outlet_media where media_id in (select
media_id from media where media_name='Online News');
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1.19..21647.53 rows=362125 width=16) (actual
time=0.034..0.034 rows=0 loops=1)
Hash Cond: (outlet_media.media_id = media.media_id)
-> Seq Scan on outlet_media (cost=0.00..16736.76 rows=1086376
width=16) (actual time=0.012..0.012 rows=1 loops=1)
-> Hash (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 0kB
-> HashAggregate (cost=1.17..1.18 rows=1 width=8) (actual
time=0.013..0.013 rows=0 loops=1)
-> Seq Scan on media (cost=0.00..1.16 rows=1 width=8)
(actual time=0.012..0.012 rows=0 loops=1)
Filter: ((media_name)::text = 'Online News'::text)
Total runtime: 0.084 ms
(9 rows)

Thanks in advance for whatever light can be shed. If it's safer for me
to just create individual indexes on each of the two columns ("
Multicolumn indexes should be used sparingly. In most situations, an
index on a single column is sufficient and saves space and time")

regards
Bill

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vo, Catherine CTR DTIC Z 2012-05-21 19:40:13 Re: backup script
Previous Message Merlin Moncure 2012-05-21 16:14:03 Re: Libpq question