From: | pgsql(at)mohawksoft(dot)com |
---|---|
To: | "Greg Stark" <gsstark(at)mit(dot)edu> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Query optimizer 8.0.1 (and 8.0) |
Date: | 2005-02-08 14:33:26 |
Message-ID: | 16625.24.91.171.78.1107873206.squirrel@mail.mohawksoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> pgsql(at)mohawksoft(dot)com writes:
>
>> The basic problem with a fixed sample is that is assumes a normal
>> distribution.
>
> That's sort of true, but not in the way you think it is.
>
[snip]
Greg, I think you have an excellent ability to articulate stats, but I
think that the view that this is like election polling is incorrect.
Election polling assumes a very simple outcome: Some standard ditribution
of a limited number options. I don't think it applies to this.
>
>> When you look at a sine wave on an oscilloscope, you can see it clear as
>> day. When you look at music on the scope, you know there are many waves
>> there, but it is difficult to make heads or tails of it. (use xmms or
>> winamp to see for yourself) The waves change in frequency, amplitude,
>> and
>> duration over a very large scale. That's why you use a spectrum analyzer
>> to go from time domain to frequency domain. In frequency domain, you can
>> see the trends better.
>
> That's not a bad analogy to many problems where you're measuring data that
> has
> non-randomness in it but that are not visible in the domain that the
> statistics that are being analyzed. This seems to happen a lot with
> geographic
> data, for instance.
EXACTLY!!!
>
> If you find that increasing the stats targets improves things then this
> isn't true. If you find that it doesn't then what's really needed is a
> cleverer set of statistics to look for.
I will be the first one to say that increasing the samples is not perfect,
but it is a methodology that will help without major changes in postgres.
Simply increasing the samples to a percentage of the estimated number of
rows (with some upper and lower limits of course) will increase the
accuracy of the "n_distinct" and "correlation" settings (at least a little
bit), and that will make a huge impact with very little work.
If we want to discuss improved statatistics, then we should include a
standard deviation and a sliding window deviation, or something like that.
Hell, maybe even FFT.
The basic problem, I think, is that the sampling mechanism is more like an
oscilloscope looking for large trends instead of a spectrum analyzer
looking for the smaller ones.
We have to be able to tell the planner that adjacent values are less
random even though, as a whole, they are seemingly random.
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2005-02-08 15:25:26 | Re: float4 regression test failed on linux parisc |
Previous Message | Jim Buttafuoco | 2005-02-08 14:14:35 | Fw: Re: float4 regression test failed on linux parisc |