Re: More thoughts about planner's cost estimates

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: More thoughts about planner's cost estimates
Date: 2006-06-01 20:50:26
Message-ID: 200606011350.26364.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greg, Tom,

> a) We already use block based sampling to reduce overhead. If you're
> talking about using the entire block and not just randomly sampled
> tuples from within those blocks then your sample will be biased.

There are actually some really good equations to work with this, estimating
both row correlation and n-distinct based on sampling complete blocks.
See prior discussions on this list on N-distinct.

> 1) You have n^2 possible two-column combinations. That's a lot of
> processing and storage.

Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2.

> 2) It isn't even clear what data you're exactly looking for. Certainly
> "correlation" is just shorthand here and isn't what you're actually
> looking for.

Actually, I'd think that a correlation number estimate (0 = complete
uncorrelated, 1 = completely correlated) would be sufficient to improve
row count estimation significantly, without incurring the vast overhead of
histogramXhistorgram manipulation.

> Those are all valid points, but pretty much orthogonal to what I'm on
> about today. The problems I'm thinking about are seen even when the
> planner's rowcount estimates are dead on (as indeed they mostly were
> in Philippe's example).

OK, I was afraid they were interdependant. You would know better than me.

> Well, it'll still exist as a GUC for the same reasons the other ones are
> GUCs. But I think the main reasons people have needed to twiddle it are
> (1) their database is completely RAM-resident (where RPC
> *should* logically be 1.0), or
> (2) they're trying to compensate for the overestimation of
> nestloop indexscans.
> The changes I'm proposing should help with (2).

Right. What I'm saying is that (1) should be derived from
estimated_cache_size and DBSIZE, not by setting an additional GUC.

> > (1) the frequency with which that index is accessed, modified by
> > (2) whether the index is a fraction of available ram, or larger than
> > ram e) the probability that the relevant table pages are cached in
> > memory, determined by the same two factors.
>
> These would all be nice things to know, but I'm afraid it's pie in the
> sky. We have no reasonable way to get those numbers. (And if we could
> get them, there would be another set of problems, namely plan
> instability: the planner's choices would become very difficult to
> reproduce.)

Hmmm ... I think those numbers are possible to infer, at least on a gross
level. Whether the cost of calculating them outweighs the benefit of
having them is another question, resolvable only through some
experimentation.

> a good place to start. The immediate problem is to get an
> infrastructure in place that gives us some numbers to apply equations
> to.

No arguments, of course.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-06-01 21:21:03 Re: Generalized concept of modules
Previous Message Martijn van Oosterhout 2006-06-01 20:45:39 Re: Generalized concept of modules