Re: Query optimizer 8.0.1 (and 8.0)

From: pgsql(at)mohawksoft(dot)com
To: "Mark Kirkwood" <markir(at)coretech(dot)co(dot)nz>
Cc: pgsql(at)mohawksoft(dot)com, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Ron Mayer" <rm_pg(at)cheapcomplexdevices(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Query optimizer 8.0.1 (and 8.0)
Date: 2005-02-07 22:36:27
Message-ID: 16397.24.91.171.78.1107815787.squirrel@mail.mohawksoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Maybe I am missing something - ISTM that you can increase your
> statistics target for those larger tables to obtain a larger (i.e.
> better) sample.

No one is arguing that you can't manually do things, but I am not the
first to notice this. I saw the query planner doing something completely
stupid and set off to discover why.

Think about the person using PostgreSQL for the first time. He/she does
not know about this stuff. Even if they've read the FAQs and the manual
cover to cover, it will take them some time to figure out it all works
together. PostgreSQL is a big system, and this is exactly why MySQL gets
better marks from newbes.

In this case, the behavior observed could be changed by altering the
sample size for a table. I submit that an arbitrary fixed sample size is
not a good base for the analyzer, but that the sample size should be based
on the size of the table or some calculation of its deviation.

There is no reason why old stats can't be used to create more accurate
stats. Using succesive analyze operations, we could create better
statistics for the planner. We can increase the sample size based on the
table size. We could, I suppose, also calculate some sort of deviation
statistic so that "n_distinct" can be calculated better with a smaller
sample set.

The basic problem, though, is that PostgreSQL performed incorrectly on a
simple query after indexes were created and analyze performed. Yes, it can
be corrected, that's what led me to my conclusions, but shouldn't we try
to devise a better system in the future to improve PostgreSQL so it does
not need this sort of tuning?

>
> regards
>
> Mark
>
> pgsql(at)mohawksoft(dot)com wrote:
>>>pgsql(at)mohawksoft(dot)com writes:
>> Any and all random sampling assumes a degree of uniform distribution.
>> This
>> is the basis of the model. It assumes that chunks of the whole will be
>> representative of the whole (to some degree). This works when normal
>> variations are more or less distributed uniformly. As variations and
>> trends becomes less uniformly distributed, more samples are required to
>> characterize it.
>>
>> Douglas Adams had a great device called the "Total Perspective Vortex"
>> which infered the whole of the universe from a piece of fairy cake. It
>> was
>> a subtle play on the absurd notion that a very small sample could lead
>> to
>> an understanding of an infinitly larger whole.
>>
>> On a very basic level, why bother sampling the whole table at all? Why
>> not
>> check one block and infer all information from that? Because we know
>> that
>> isn't enough data. In a table of 4.6 million rows, can you say with any
>> mathmatical certainty that a sample of 100 points can be, in any way,
>> representative?
>>
>> Another problem with random sampling is trend analysis. Often times
>> there
>> are minor trends in data. Ron pointed out the lastname firstname trend.
>> Although there seems to be no correlation between firstnames in the
>> table,
>> there are clearly groups or clusters of ordered data that is an ordering
>> that is missed by too small a sample.
>>
>> I understand why you chose the Vitter algorithm, because it provides a
>> basically sound methodology for sampling without knowledge of the size
>> of
>> the whole, but I think we can do better. I would suggest using the
>> current
>> algorithm the first time through, then adjust the number of samples [n]
>> based on the previous estimate of the size of the table [N]. Each
>> successive ANALYZE will become more accurate. The Vitter algorithm is
>> still useful as [N] will always be an estimate.
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2005-02-07 22:45:23 Re: Query optimizer 8.0.1 (and 8.0)
Previous Message Alvaro Herrera 2005-02-07 22:31:33 Re: Query optimizer 8.0.1 (and 8.0)