Quick Links

Re: Query optimizer 8.0.1 (and 8.0)

From:	pgsql(at)mohawksoft(dot)com
To:	"Greg Stark" <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Query optimizer 8.0.1 (and 8.0)
Date:	2005-02-08 14:33:26
Message-ID:	16625.24.91.171.78.1107873206.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> pgsql(at)mohawksoft(dot)com writes:
>
>> The basic problem with a fixed sample is that is assumes a normal
>> distribution.
>
> That's sort of true, but not in the way you think it is.
>
[snip]

Greg, I think you have an excellent ability to articulate stats, but I
think that the view that this is like election polling is incorrect.

Election polling assumes a very simple outcome: Some standard ditribution
of a limited number options. I don't think it applies to this.

>
>> When you look at a sine wave on an oscilloscope, you can see it clear as
>> day. When you look at music on the scope, you know there are many waves
>> there, but it is difficult to make heads or tails of it. (use xmms or
>> winamp to see for yourself) The waves change in frequency, amplitude,
>> and
>> duration over a very large scale. That's why you use a spectrum analyzer
>> to go from time domain to frequency domain. In frequency domain, you can
>> see the trends better.
>
> That's not a bad analogy to many problems where you're measuring data that
> has
> non-randomness in it but that are not visible in the domain that the
> statistics that are being analyzed. This seems to happen a lot with
> geographic
> data, for instance.

EXACTLY!!!

>
> If you find that increasing the stats targets improves things then this
> isn't true. If you find that it doesn't then what's really needed is a
> cleverer set of statistics to look for.

I will be the first one to say that increasing the samples is not perfect,
but it is a methodology that will help without major changes in postgres.
Simply increasing the samples to a percentage of the estimated number of
rows (with some upper and lower limits of course) will increase the
accuracy of the "n_distinct" and "correlation" settings (at least a little
bit), and that will make a huge impact with very little work.

If we want to discuss improved statatistics, then we should include a
standard deviation and a sliding window deviation, or something like that.
Hell, maybe even FFT.

The basic problem, I think, is that the sampling mechanism is more like an
oscilloscope looking for large trends instead of a spectrum analyzer
looking for the smaller ones.

We have to be able to tell the planner that adjacent values are less
random even though, as a whole, they are seemingly random.

In response to

Re: Query optimizer 8.0.1 (and 8.0) at 2005-02-08 02:02:54 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2005-02-08 15:25:26	Re: float4 regression test failed on linux parisc
Previous Message	Jim Buttafuoco	2005-02-08 14:14:35	Fw: Re: float4 regression test failed on linux parisc