Re: serious under-estimation of n_distinct for clustered distributions

From: Stefan Andreatta <s(dot)andreatta(at)synedra(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: serious under-estimation of n_distinct for clustered distributions
Date: 2013-01-14 07:35:31
Message-ID: 50F3B543.8020205@synedra.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

A status update on this problem:

1.) Workarounds (setting n_distinct manually) are tested and - as far as
workarounds go - OK.

2.) Source of the problem and possible solution:

The source of these troubles is the sampling method employed in
src/backend/commands/analyze.c. Judging from Tom Lane's comment for the
original implementation in 2004 this has never been thought to be
perfect. Does anybody see a chance to improve that part? Should this
discussion be taken elsewhere? Is there any input from my side that
could help?

btw: I do find this problem to be very frequent in our databases. And
considering the commonplace conditions leading to it, I would expect
many systems to be affected. But searching the forums and the web I
hardly found any references to it - which amazes me to no end.

Best Regards,
Stefan

On 12/30/2012 07:02 PM, Stefan Andreatta wrote:
> On 12/29/2012 10:57 PM, Peter Geoghegan wrote:
>> On 29 December 2012 20:57, Stefan Andreatta <s(dot)andreatta(at)synedra(dot)com>
>> wrote:
>>> Now, the 2005 discussion goes into great detail on the advantages and
>>> disadvantages of this algorithm, particularly when using small
>>> sample sizes,
>>> and several alternatives are discussed. I do not know whether
>>> anything has
>>> been changed after that, but I know that the very distinct problem,
>>> which I
>>> will focus on here, still persists.
>>
>> It's a really hard problem to solve satisfactorily. It's a problem
>> that has been studied in much detail. Yes, the algorithm used is still
>> the same. See the comments within src/backend/commands/analyze.c (IBM
>> Research Report RJ 10025 is referenced there).
>
> Thanks a lot for this information! I looked through the code a bit.
> The Haas & Stokes Formula is fine. The problem really lies with the
> two phase random selection procedure:
>
> Starting from line 1039, there is a comment:
> * As of May 2004 we use a new two-stage method: Stage one selects up
> * to targrows random blocks (or all blocks, if there aren't so many).
> * Stage two scans these blocks and uses the Vitter algorithm to create
> * a random sample of targrows rows (or less, if there are less in the
> * sample of blocks). The two stages are executed simultaneously: each
> * block is processed as soon as stage one returns its number and while
> * the rows are read stage two controls which ones are to be inserted
> * into the sample.
> *
> * Although every row has an equal chance of ending up in the final
> * sample, this sampling method is not perfect: not every possible
> * sample has an equal chance of being selected. For large relations
> * the number of different blocks represented by the sample tends to be
> * too small. We can live with that for now. Improvements are welcome.
>
>
> Now the problem with clustered data is, that the probability of
> sampling a value twice is much higher when the same page is repeatedly
> sampled. As stage one takes a random sample of pages, and stage two
> samples rows from these pages, the probability of visiting the same
> page twice (or more often) is much higher than if random rows were
> selected from the whole table. Hence we get a lot more multiple values
> for clustered data and we end up with the severe under-estimation we
> can see in those cases.
>
> Probabilities do my brain in, as usual, but I tested the procedure for
> my test data with a simple python script. There is absolutely nothing
> wrong with the implementation. It seems to be a purely statistical
> problem.
>
> Not everything may be hopeless though ;-) The problem could
> theoretically be avoided if random rows were selected from the whole
> table. Again, that may not be feasible - the two phase approach was
> probably not implemented for nothing.
>
> Another possible solution would be to avoid much of the resampling
> (not all) in phase two. For that - in theory - every page visited
> would have to get a lower weight, so that revisiting this page is not
> any more likely as rows were selected from the whole column. That does
> not sound easy or elegant to implement. But perhaps there is some
> clever algorithm - unfortunately I do not know.
>
>
>> The general advice here is:
>>
>> 1) Increase default_statistics_target for the column.
>>
>> 2) If that doesn't help, consider using the following DDL:
>>
>> alter table foo alter column bar set ( n_distinct = 5.0);
>
> Yes, I will probably have to live with that for now - I will come back
> to these workarounds with one or two questions.
>
> Thanks again & Regards,
> Seefan
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Peter Geoghegan 2013-01-14 11:34:12 Re: serious under-estimation of n_distinct for clustered distributions
Previous Message Alexandre de Arruda Paes 2013-01-13 21:49:18 From TODO: Simplify creation of partitioned tables