Re: Confusion about composite indexes

From: Chris Curvey <chris(at)chriscurvey(dot)com>
To: Bill Mitchell <bill(at)publicrelay(dot)com>
Cc: General PostgreSQL List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Confusion about composite indexes
Date: 2012-05-21 19:58:02
Message-ID: CADfwSsDpE=ssioRcxJ8ry8B7wfC5neKzQcyamTu34HRj_Hcf1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, May 21, 2012 at 3:34 PM, Bill Mitchell <bill(at)publicrelay(dot)com> wrote:

> I am searching for some logic behind the selection of an index in
> postgres -- it seems that if I have a composite index based on both columns
> in a join table, it's only referenced if I query on the first term in the
> composite index. I've read
> http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over
> and over and think that this is the same scenario as what I face.
>
> As an example:
> OUTLET: has OUTLET_ID as a primary key, consisting of about a million rows
> MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
> OUTLET_MEDIA: a join table used to correlate, and this has about a million
> rows
>
> Each outlet may have 1+ Media (technically, 0 or more, in this schema)
>
> Table "public.outlet_media"
> Column | Type | Modifiers
> -----------+--------+-----------
> outlet_id | bigint | not null
> media_id | bigint | not null
> Indexes:
> "outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
> Foreign-key constraints:
> "fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
> "fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
> outlet(outlet_id)
>
> When I test performance, using an OUTLET_ID, the query uses the
> outlet_media_pkey index
>
> # explain analyze select * from outlet_media where outlet_id in (select
> outlet_id from outlet order by random() limit 50);
> QUERY
> PLAN
>
> ------------------------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=67625.64..68048.50 rows=50 width=16) (actual
> time=841.115..884.669 rows=50 loops=1)
> -> HashAggregate (cost=67625.64..67626.14 rows=50 width=8) (actual
> time=841.048..841.090 rows=50 loops=1)
> -> Limit (cost=67624.89..67625.01 rows=50 width=8) (actual
> time=840.980..841.011 rows=50 loops=1)
> -> Sort (cost=67624.89..70342.66 rows=1087110 width=8)
> (actual time=840.978..840.991 rows=50 loops=1)
> Sort Key: (random())
> Sort Method: top-N heapsort Memory: 27kB
> -> Seq Scan on outlet (cost=0.00..31511.88
> rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
> -> Index Scan using outlet_media_pkey on outlet_media
> (cost=0.00..8.43 rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
> Index Cond: (outlet_id = outlet.outlet_id)
> Total runtime: 884.759 ms
> (10 rows)
>
> However if I try the reverse, to search using the MEDIA_ID
> # explain analyze select * from outlet_media where media_id in (select
> media_id from media where media_name='Online News');
> QUERY
> PLAN
>
> -----------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=1.19..21647.53 rows=362125 width=16) (actual
> time=0.034..0.034 rows=0 loops=1)
> Hash Cond: (outlet_media.media_id = media.media_id)
> -> Seq Scan on outlet_media (cost=0.00..16736.76 rows=1086376
> width=16) (actual time=0.012..0.012 rows=1 loops=1)
> -> Hash (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
> rows=0 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 0kB
> -> HashAggregate (cost=1.17..1.18 rows=1 width=8) (actual
> time=0.013..0.013 rows=0 loops=1)
> -> Seq Scan on media (cost=0.00..1.16 rows=1 width=8)
> (actual time=0.012..0.012 rows=0 loops=1)
> Filter: ((media_name)::text = 'Online News'::text)
> Total runtime: 0.084 ms
> (9 rows)
>
>
> Thanks in advance for whatever light can be shed. If it's safer for me to
> just create individual indexes on each of the two columns (" Multicolumn
> indexes should be used sparingly. In most situations, an index on a single
> column is sufficient and saves space and time")
>
> regards
> Bill
>
>
You are absolutely correct. It may be easier to understand if you think
about the structure of the index keys. For clarity, I'm going to pretend
that the "outlet" id is a character instead of an integer.

The index is sorted in the order of the columns in the index. So our
sorted values might look something like this (my apologies if the spacing
does not come through properly)

outlet media
A 1
A 2
A 3
B 1
B 2
B 3
C 1
C 2
C 3

So if I want to search for all the entries where outlet = 'a', then I find
the first 'A' record, search until I find something that is not 'A', and I
know that I'm done.

But to find all the entries where media = 1, I would have to search every
key in the index. And because there is an unknown amount of work involved
in searching the index, the optimizer assumes that using the index will be
more work than just scanning the table.

So what should you do? I would keep the 2-key index, because you want that
for uniqueness. And the index as you have structured it is useful for
searching if you are specifying the outlet. But you may want to add a
non-unique index on the media_id column, so that you will have a useful
index for searching.

The Standard Index Tradeoff Disclaimer (TM) applies: indexes can speed up
searches, but they will slow down inserts/updates/deletes because the index
must be maintained.

--
e-Mail is the equivalent of a postcard written in pencil. This message may
not have been sent by me, or intended for you. It may have been read or
even modified while in transit. e-Mail disclaimers have the same force in
law as a note passed in study hall. If your corporate attorney says that
you need an disclaimer in your signature, you need a new corporate attorney.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2012-05-21 20:07:02 Re: Confusion about composite indexes
Previous Message Vo, Catherine CTR DTIC Z 2012-05-21 19:40:13 Re: backup script