index v. seqscan for certain values

From: "Jeremy Dunn" <jdunn(at)autorevenue(dot)com>
To: "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: index v. seqscan for certain values
Date: 2004-04-12 15:40:28
Message-ID: 000901c420a4$7b733650$4f01a8c0@jeremydunn
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I've searched the archives and can't find an answer to this seemingly
simple question. Apologies if it's too common.

The table in question has ~1.3M rows. It has 85 columns, 5 of which
have single-column indexes.

The column in question (CID) has 183 distinct values. For these values,
the largest has ~38,000 rows, and the smallest has 1 row. About 30
values have < 100 rows, and about 10 values have > 20,000 rows.

The database is 7.2.3 running on RedHat 7.1. (we are in process of
upgrading to PG 7.4.2) All of the query plan options are enabled, and
the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01,
cpu_index_tuple_cost is 0.001). The database is VACUUM'd every night.

The problem:
A simply query:
select count(*) from xxx where CID=<smalval>
where <smalval> is a CID value which has relatively few rows, returns a
plan using the index on that column.

explain analyze select count(*) from xxx where cid=869366;
Aggregate (cost=19136.33..19136.33 rows=1 width=0) (actual
time=78.49..78.49 rows=1 loops=1)
-> Index Scan using xxx_cid on emailrcpts (cost=0.00..19122.21
rows=5648 width=0) (actual time=63.40..78.46 rows=1 loops=1)
Total runtime: 78.69 msec

The same plan is true for values which have up to about 20,000 rows:

explain analyze select count(*) from xxx where cid=6223341;
Aggregate (cost=74384.19..74384.19 rows=1 width=0) (actual
time=11614.89..11614.89 rows=1 loops=1)
-> Index Scan using xxx_cid on emailrcpts (cost=0.00..74329.26
rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
Total runtime: 11615.05 msec

However for the values that have > 20,000 rows, the plan changes to a
sequential scan, which is proportionately much slower.

explain analyze select count(*) from xxx where cid=7191032;
Aggregate (cost=97357.61..97357.61 rows=1 width=0) (actual
time=46427.81..46427.82 rows=1 loops=1)
-> Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0)
(actual time=9104.45..46370.27 rows=37765 loops=1)
Total runtime: 46428.00 msec


The question: why does the planner consider a sequential scan to be
better for these top 10 values? In terms of elapsed time it is more
than twice as slow, proportionate to an index scan for the same number
of rows.

What I tried:

A) alter table xxx alter column cid set statistics 500;
analyze xxx;
This does not affect the results.

B) dropped/rebuilt the index, with no improvement.

C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no
success

D) force an index scan for the larger values by using a very high value
for cpu_tuple_cost (e.g. .5) but this doesn't seem like a wise thing to
do.

Your thoughts appreciated in advance!

- Jeremy

7+ years experience in Oracle performance-tuning
relatively new to postgresql

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Bill Moran 2004-04-12 16:09:15 Re: index v. seqscan for certain values
Previous Message Shea,Dan [CIS] 2004-04-12 15:18:31 Re: Deleting certain duplicates