Re: index v. seqscan for certain values

From: Bill Moran <wmoran(at)potentialtech(dot)com>
To: jdunn(at)autorevenue(dot)com
Cc: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: index v. seqscan for certain values
Date: 2004-04-12 16:09:15
Message-ID: 407ABF2B.80306@potentialtech.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Quick bit of input, since you didn't mention it.

How often do you run ANALYZE? I found it interesting that a database I
was doing tests on sped up by a factor of 20 after ANALYZE. If your
data changes a lot, you should probably schedule ANALYZE to run with
VACUUM.

Jeremy Dunn wrote:
> I've searched the archives and can't find an answer to this seemingly
> simple question. Apologies if it's too common.
>
> The table in question has ~1.3M rows. It has 85 columns, 5 of which
> have single-column indexes.
>
> The column in question (CID) has 183 distinct values. For these values,
> the largest has ~38,000 rows, and the smallest has 1 row. About 30
> values have < 100 rows, and about 10 values have > 20,000 rows.
>
> The database is 7.2.3 running on RedHat 7.1. (we are in process of
> upgrading to PG 7.4.2) All of the query plan options are enabled, and
> the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01,
> cpu_index_tuple_cost is 0.001). The database is VACUUM'd every night.
>
> The problem:
> A simply query:
> select count(*) from xxx where CID=<smalval>
> where <smalval> is a CID value which has relatively few rows, returns a
> plan using the index on that column.
>
> explain analyze select count(*) from xxx where cid=869366;
> Aggregate (cost=19136.33..19136.33 rows=1 width=0) (actual
> time=78.49..78.49 rows=1 loops=1)
> -> Index Scan using xxx_cid on emailrcpts (cost=0.00..19122.21
> rows=5648 width=0) (actual time=63.40..78.46 rows=1 loops=1)
> Total runtime: 78.69 msec
>
> The same plan is true for values which have up to about 20,000 rows:
>
> explain analyze select count(*) from xxx where cid=6223341;
> Aggregate (cost=74384.19..74384.19 rows=1 width=0) (actual
> time=11614.89..11614.89 rows=1 loops=1)
> -> Index Scan using xxx_cid on emailrcpts (cost=0.00..74329.26
> rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
> Total runtime: 11615.05 msec
> However for the values that have > 20,000 rows, the plan changes to a
> sequential scan, which is proportionately much slower.
>
> explain analyze select count(*) from xxx where cid=7191032;
> Aggregate (cost=97357.61..97357.61 rows=1 width=0) (actual
> time=46427.81..46427.82 rows=1 loops=1)
> -> Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0)
> (actual time=9104.45..46370.27 rows=37765 loops=1)
> Total runtime: 46428.00 msec
>
>
> The question: why does the planner consider a sequential scan to be
> better for these top 10 values? In terms of elapsed time it is more
> than twice as slow, proportionate to an index scan for the same number
> of rows.
>
> What I tried:
>
> A) alter table xxx alter column cid set statistics 500;
> analyze xxx;
> This does not affect the results.
>
> B) dropped/rebuilt the index, with no improvement.
>
> C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no
> success
>
> D) force an index scan for the larger values by using a very high value
> for cpu_tuple_cost (e.g. .5) but this doesn't seem like a wise thing to do.
>
> Your thoughts appreciated in advance!
>
> - Jeremy
>
> 7+ years experience in Oracle performance-tuning
> relatively new to postgresql

--
Bill Moran
Potential Technologies
http://www.potentialtech.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jeremy Dunn 2004-04-12 17:08:05 Re: index v. seqscan for certain values
Previous Message Jeremy Dunn 2004-04-12 15:40:28 index v. seqscan for certain values