Re: index v. seqscan for certain values

From: "Jeremy Dunn" <jdunn(at)autorevenue(dot)com>
To: "'Postgresql Performance'" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: index v. seqscan for certain values
Date: 2004-04-12 19:05:02
Message-ID: 001d01c420c1$0ed23290$4f01a8c0@jeremydunn
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> "Jeremy Dunn" <jdunn(at)autorevenue(dot)com> writes:
> > The question: why does the planner consider a sequential scan to be
> > better for these top 10 values?
>
> At some point a seqscan *will* be better. In the limit, if
> the key being sought is common enough to occur on every page
> of the table, it's certain that a seqscan will require less
> I/O than an indexscan (because reading the index isn't
> actually saving you any heap fetches). In practice the
> breakeven point is less than that because Unix kernels are
> better at handling sequential than random access.
>
> Your gripe appears to be basically that the planner's idea of
> the breakeven point is off a bit. It looks to me like it's
> within about a factor of 2 of being right, though, which is
> not all that bad when it's using generic cost parameters.

Agreed. However, given that count(*) is a question that can be answered
_solely_ using the index (without reference to the actual data blocks),
I'd expect that the break-even point would be considerably higher than
the < 3% (~38,000 / ~1.3M) I'm currently getting. Does PG not use
solely the index in this situation??

> > A) alter table xxx alter column cid set statistics 500;
> > analyze xxx;
> > This does not affect the results.
>
> It probably improved the accuracy of the row count estimates,
> no? The estimate you show for cid=7191032 is off by more than
> 25% (37765 vs 50792), which seems like a lot of error for one
> of the most common values in the table. (I hope that was
> with default stats target and not 500.) That leads directly
> to a 25% overestimate of the cost of an indexscan, while
> having IIRC no impact on the cost of a seqscan. Since the
> cost ratio was more than 25%, this didn't change the selected
> plan, but you want to fix that error as best you can before
> you move on to tweaking cost parameters.

Actually it made them worse! Yes, this was the default statistics (10).
When I just tried it again with a value of 300, analyze, then run the
query, I get a *worse* result for an estimate. I don't understand this.

alter table xxx alter column cid set statistics 300;
analyze emailrcpts;
set random_page_cost to 2;
explain analyze select count(*) from xxx where cid=7191032;

Aggregate (cost=20563.28..20563.28 rows=1 width=0) (actual
time=7653.90..7653.90 rows=1 loops=1)
-> Index Scan using xxx_cid on xxx (cost=0.00..20535.82 rows=10983
width=0) (actual time=72.24..7602.38 rows=37765 loops=1)
Total runtime: 7654.14 msec

Now it estimates I have only 10,983 rows (~3x too low) instead of the
old estimate 50,792 (1.3x too high). Why is that ??

Anyway, a workable solution seems to be using a lower value for
Random_Page_Cost. Thanks to everyone who replied with this answer.

> Also it is likely appropriate to increase
> effective_cache_size, which is awfully small in the default
> configuration. I'd set that to something related to your
> available RAM before trying to home in on a suitable random_page_cost.

We have ours set to the default value of 1000, which does seem low for a
system with 1GB of RAM. We'll up this once we figure out what's
available. Then tweak the Random_Page_Cost appropriately at that point.

I'd still like to understand the strangeness above, if anyone can shed
light.

- Jeremy

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Bruno Wolff III 2004-04-12 19:55:52 Re: index v. seqscan for certain values
Previous Message Pailloncy Jean-Gérard 2004-04-12 19:02:02 Re: Index Backward Scan fast / Index Scan slow ! (Modifié par Pailloncy Jean-Gérard)