From: | Josh Berkus <josh(at)agliodbs(dot)com> |
---|---|
To: | Greg Stark <gsstark(at)mit(dot)edu> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: More thoughts about planner's cost estimates |
Date: | 2006-06-01 20:50:26 |
Message-ID: | 200606011350.26364.josh@agliodbs.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Greg, Tom,
> a) We already use block based sampling to reduce overhead. If you're
> talking about using the entire block and not just randomly sampled
> tuples from within those blocks then your sample will be biased.
There are actually some really good equations to work with this, estimating
both row correlation and n-distinct based on sampling complete blocks.
See prior discussions on this list on N-distinct.
> 1) You have n^2 possible two-column combinations. That's a lot of
> processing and storage.
Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2.
> 2) It isn't even clear what data you're exactly looking for. Certainly
> "correlation" is just shorthand here and isn't what you're actually
> looking for.
Actually, I'd think that a correlation number estimate (0 = complete
uncorrelated, 1 = completely correlated) would be sufficient to improve
row count estimation significantly, without incurring the vast overhead of
histogramXhistorgram manipulation.
> Those are all valid points, but pretty much orthogonal to what I'm on
> about today. The problems I'm thinking about are seen even when the
> planner's rowcount estimates are dead on (as indeed they mostly were
> in Philippe's example).
OK, I was afraid they were interdependant. You would know better than me.
> Well, it'll still exist as a GUC for the same reasons the other ones are
> GUCs. But I think the main reasons people have needed to twiddle it are
> (1) their database is completely RAM-resident (where RPC
> *should* logically be 1.0), or
> (2) they're trying to compensate for the overestimation of
> nestloop indexscans.
> The changes I'm proposing should help with (2).
Right. What I'm saying is that (1) should be derived from
estimated_cache_size and DBSIZE, not by setting an additional GUC.
> > (1) the frequency with which that index is accessed, modified by
> > (2) whether the index is a fraction of available ram, or larger than
> > ram e) the probability that the relevant table pages are cached in
> > memory, determined by the same two factors.
>
> These would all be nice things to know, but I'm afraid it's pie in the
> sky. We have no reasonable way to get those numbers. (And if we could
> get them, there would be another set of problems, namely plan
> instability: the planner's choices would become very difficult to
> reproduce.)
Hmmm ... I think those numbers are possible to infer, at least on a gross
level. Whether the cost of calculating them outweighs the benefit of
having them is another question, resolvable only through some
experimentation.
> a good place to start. The immediate problem is to get an
> infrastructure in place that gives us some numbers to apply equations
> to.
No arguments, of course.
--
--Josh
Josh Berkus
PostgreSQL @ Sun
San Francisco
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2006-06-01 21:21:03 | Re: Generalized concept of modules |
Previous Message | Martijn van Oosterhout | 2006-06-01 20:45:39 | Re: Generalized concept of modules |