Re: Confusion about composite indexes

From: Dmitriy Igrishin <dmitigr(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Bill Mitchell <bill(at)publicrelay(dot)com>, General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Confusion about composite indexes
Date: 2012-05-21 20:36:07
Message-ID: CAAfz9KMVTwwpHpvuQAvOBrEuBRGsC2wo2vkJ=oajs08vQa9tgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

2012/5/22 Merlin Moncure <mmoncure(at)gmail(dot)com>

> On Mon, May 21, 2012 at 2:34 PM, Bill Mitchell <bill(at)publicrelay(dot)com>
> wrote:
> > I am searching for some logic behind the selection of an index in
> postgres
> > -- it seems that if I have a composite index based on both columns in a
> join
> > table, it's only referenced if I query on the first term in the composite
> > index. I've read
> > http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over
> and
> > over and think that this is the same scenario as what I face.
> >
> > As an example:
> > OUTLET: has OUTLET_ID as a primary key, consisting of about a million
> rows
> > MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
> > OUTLET_MEDIA: a join table used to correlate, and this has about a
> million
> > rows
> >
> > Each outlet may have 1+ Media (technically, 0 or more, in this schema)
> >
> > Table "public.outlet_media"
> > Column | Type | Modifiers
> > -----------+--------+-----------
> > outlet_id | bigint | not null
> > media_id | bigint | not null
> > Indexes:
> > "outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
> > Foreign-key constraints:
> > "fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES
> media(media_id)
> > "fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
> > outlet(outlet_id)
> >
> > When I test performance, using an OUTLET_ID, the query uses the
> > outlet_media_pkey index
> >
> > # explain analyze select * from outlet_media where outlet_id in (select
> > outlet_id from outlet order by random() limit 50);
> > QUERY
> > PLAN
> >
> ------------------------------------------------------------------------------------------------------------------------------------------
> > Nested Loop (cost=67625.64..68048.50 rows=50 width=16) (actual
> > time=841.115..884.669 rows=50 loops=1)
> > -> HashAggregate (cost=67625.64..67626.14 rows=50 width=8) (actual
> > time=841.048..841.090 rows=50 loops=1)
> > -> Limit (cost=67624.89..67625.01 rows=50 width=8) (actual
> > time=840.980..841.011 rows=50 loops=1)
> > -> Sort (cost=67624.89..70342.66 rows=1087110 width=8)
> > (actual time=840.978..840.991 rows=50 loops=1)
> > Sort Key: (random())
> > Sort Method: top-N heapsort Memory: 27kB
> > -> Seq Scan on outlet (cost=0.00..31511.88
> > rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
> > -> Index Scan using outlet_media_pkey on outlet_media
> (cost=0.00..8.43
> > rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
> > Index Cond: (outlet_id = outlet.outlet_id)
> > Total runtime: 884.759 ms
> > (10 rows)
> >
> > However if I try the reverse, to search using the MEDIA_ID
> > # explain analyze select * from outlet_media where media_id in (select
> > media_id from media where media_name='Online News');
> > QUERY
> > PLAN
> >
> -----------------------------------------------------------------------------------------------------------------------
> > Hash Join (cost=1.19..21647.53 rows=362125 width=16) (actual
> > time=0.034..0.034 rows=0 loops=1)
> > Hash Cond: (outlet_media.media_id = media.media_id)
> > -> Seq Scan on outlet_media (cost=0.00..16736.76 rows=1086376
> width=16)
> > (actual time=0.012..0.012 rows=1 loops=1)
> > -> Hash (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
> > rows=0 loops=1)
> > Buckets: 1024 Batches: 1 Memory Usage: 0kB
> > -> HashAggregate (cost=1.17..1.18 rows=1 width=8) (actual
> > time=0.013..0.013 rows=0 loops=1)
> > -> Seq Scan on media (cost=0.00..1.16 rows=1 width=8)
> > (actual time=0.012..0.012 rows=0 loops=1)
> > Filter: ((media_name)::text = 'Online News'::text)
> > Total runtime: 0.084 ms
> > (9 rows)
> >
> >
> > Thanks in advance for whatever light can be shed. If it's safer for me
> to
> > just create individual indexes on each of the two columns (" Multicolumn
> > indexes should be used sparingly. In most situations, an index on a
> single
> > column is sufficient and saves space and time")
>
> There are a couple of things going on here. First of all, 'order by
> random() limit..' guarantees a full seq scan and a short of whatever
> you're ordering, so it's never going to be very efficient. When you
> wen the other way, you supplied a hard search term which can be fed to
> the index and optimized through.
>
> Secondly, if you have a mapping table that has to support searches in
> both directions, it's pretty typical to do something like this:
>
> create table map(a int references ..., b int references..., primary
> key(a,b));
> create index on map(b);
>
> So you can get fully index lookups on all of a, b, ab, and ba. the
> primary key can't optimize ba because indexes only fully match if
> candidate fields are supplied from left to right order. They can
> still help somewhat, but to a lesser degree.
>
BTW, I would like to know is it worth it to create 3rd index on map(a)
to reduce the size of the index which will be used by the planer
to save some server's RAM (obviously, at the cost of extra disk space) ?

--
// Dmitriy.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message J.V. 2012-05-21 20:39:40 how to for loop with distinct values?
Previous Message Merlin Moncure 2012-05-21 20:07:02 Re: Confusion about composite indexes