From: | Tomas Vondra <tv(at)fuzzy(dot)cz> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: proposal : cross-column stats |
Date: | 2010-12-24 13:50:39 |
Message-ID: | 4D14A52F.8040302@fuzzy.cz |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dne 24.12.2010 04:41, Florian Pflug napsal(a):
> The filter size could be derived from the table's statistics target, or
> be otherwise user-definable. We could also auto-resize once it gets too
> full. But still, that all seems awfully complex :-(
Using a statistics target is a good idea I think. I think we could use
it to determine error rate of the filter. Something like
error rate = 10 - 0.9 * (statistics_target - 100)
which gives
1% for statistics target 1000
10% for statistics target 100
or maybe something like this (where the error rate grows faster for
smaller statistic target values)
error rate = 11 - 91000 / (statistics_target^2)
which gives about
1% for statistics target 1000
10% for statistics targer 100
36% for statistics target 50
But I guess 10% error rate is the minimum we need so it does not make
much sense to use lower values.
>> > Another possibility is to collect the data from just a small portion
>> > of a table and then use the result to estimate the number of distinct
>> > values for the whole table. But I'm not sure we can do this reliably,
>> > I see many traps in this.
> This is how it works currently. The problem with this approach is that
> it gives you very little guarantees about how precise the result will be.
> Extrapolating works very well for things like MKVs and histograms, because
> there you're by definition interested mostly in values which occur often -
> and thus with a high probability in the relative few rows you sample. For
> the number of distinct values, however, this isn't true - if ndistinct
> is an order of magnitude smaller than the number of rows, relatively few
> rows can account for a large percentage of the distinct values...
That basically means we need to sample a large portion of the table :-(
> Another idea would be to obtain the ndistinct values from an index somehow.
> Postgres cannot currently scan an index in physical order, only in logical
> order, due to locking considerations. But since we'd only be interested in
> an estimate, maybe a scan in physical block order would work for ndistinc
> estimates? Just a wild idea, mind you, I haven't checked at all if that'd
> be even remotely feasible.
I was thinking about that too, and I think we could do this using
pageinspect contrib module. Sure, there might be a problem with bloated
indexes.
And relying on this actually means it's required to have a multi-column
index on all the columns. Individual indexes are not enough as we need
to get the number of distinct combinations too.
regards
Tomas
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2010-12-24 14:01:58 | Re: Re: [COMMITTERS] pgsql: Move the documentation of --no-security-label to a more sensible |
Previous Message | Itagaki Takahiro | 2010-12-24 13:36:15 | Re: SQL/MED - file_fdw |