Quick Links

Re: Index Scans become Seq Scans after VACUUM ANALYSE

From:	mlw <markw(at)mohawksoft(dot)com>
To:	Andrew Sullivan <andrew(at)libertyrms(dot)info>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Index Scans become Seq Scans after VACUUM ANALYSE
Date:	2002-04-17 20:28:03
Message-ID:	3CBDDAD3.D8744B3A@mohawksoft.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Andrew Sullivan wrote:
> You haven't shown anything except a couple of anecdotal reports as
> evidence against his view. Anyone who asks you for more evidence
> gets treated to a remark that statistics won't do everything in this
> case.

I do not, currently, have access to systems which exhibit the behavior, but I
have worked with PostgreSQL quite a bit, and have done a number of projects
with it, and have seen the issue first hand and have had to work around it. I
have posted detailed data to this list in the past.

> You'll need something stronger than an antipathy for
> statistical methods to support your position.

I do not have an antipathy for statistics at all, however statistics are a
reduction of data. They represent a number of properties obtained from a larger
group. For "statistics" to be useful, the trends or characteristics you
calculate must apply to the problem.

Oracle has a cost based optimizer, and they allow you to override it, offer
hints as to what it should do, or use the rules based optimizer. They know that
a cost based optimizer can not generate the best query all the time.

> The heuristic model
> you propose is a fundamental shift from the current attempts to make
> the planner choose better plans on the basis of what's in the
> database. You're saying it can never know enough. And I say, prove
> it.

I say it is obvious it can never know enough, since statistics are a summation
of the data set from which they were obtained, thus they can not contain all
the information about that data set unless they are at least as large as the
data set.
>
> > to one another. This is where heuristics and a bit of fuzziness
> > needs to be applied. Favoring an index scan over a sequential scan
> > would probably generate a better query.
>
> Tom has argued, several times, with reference to actual cases, why
> that is false. Repeating your view doesn't make it so.

For some reason, it seems that Tom is under the impression that I am saying
"always use and index" when that is not what I am saying at all.

Here is the logical argument: (correct me if I am wrong)

(1) The table statistics are a summation of the properties of the table which,
among other things, are thought to affect query performance.

(2) The planner uses the statistics to create estimates about how a strategy
will perform.

(3) The "estimates" based on the table statistics have a degree of uncertainty,
because they are based on the statistical information about the table not the
table itself. The only way to be 100% sure you get the best query is to try all
the permutations of the query.

(4) Since the generated estimates have a degree of uncertainty, when multiple
query paths are evaluated, the planner will choose a suboptimal query once in a
while.

Now my argument, based on my personal experience, and based on Tom's own
statement that +- 2 seconds on a 10 second query is not something the gets
excited about, is this:

When the planner is presented with a number of possible plans, it must weigh
the cost estimates. If there is a choice between two plans which are within
some percent range of each other, we can fall into a risk analysis.

For instance: say we have two similarly performing plans, close to one another,
say within 20%, one plan uses an index, and one does not. It is unlikely that
the index plan will perform substantially worse than the non-index plan, right?
That is the point of the cost estimates, right?

Now, given the choice of the two strategies on a table, both pretty close to
one another, the risk of poor performance for using the index scan is minimal
based on the statistics, but the risk of poor performance for using the
sequential scan is quite high on a large table.

Does anyone disagree?

In response to

Re: Index Scans become Seq Scans after VACUUM ANALYSE at 2002-04-17 18:41:28 from mlw

Responses

Re: Index Scans become Seq Scans after VACUUM ANALYSE at 2002-04-17 20:55:26 from Andrew Sullivan
Re: Index Scans become Seq Scans after VACUUM ANALYSE at 2002-04-17 20:56:06 from Bruce Momjian
Re: Index Scans become Seq Scans after VACUUM ANALYSE at 2002-04-18 04:01:21 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Sullivan	2002-04-17 20:55:26	Re: Index Scans become Seq Scans after VACUUM ANALYSE
Previous Message	Hannu Krosing	2002-04-17 20:15:50	Re: Index Scans become Seq Scans after VACUUM ANALYSE