From: | "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru> |
---|---|
To: | Peter Geoghegan <pg(at)heroku(dot)com> |
Cc: | Jim Nasby <jim(at)nasby(dot)net>, Greg Stark <stark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: ANALYZE sampling is too good |
Date: | 2013-12-11 01:27:04 |
Message-ID: | alpine.LRH.2.00.1312110506150.19468@lnfm1.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
For what it's worth.
I'll quote Chaudhuri et al. first line from the abstract about the block
sampling.
"Block-level sampling is far more efficient than true uniform-random
sampling over a large database, but prone to significant errors if used
to create database statistics."
And after briefly glancing through the paper, my opinion is why it works
is because after making one version of statistics they cross-validate, see
how well it goes and then collect more if the cross-validation error is
large (for example because the data is clustered). Without this bit, as
far as I can a simply block based sampler will be bound to make
catastrophic mistakes depending on the distribution
Also, just another point about targets (e.g X%) for estimating stuff from
the samples (as it was discussed in the thread). Basically, the is a
point talking about a sampling a fixed target (5%) of the data
ONLY if you fix the actual distribution of your data in the table, and
decide what statistic you are trying to find, e.g. average, std. dev. a
90% percentile, ndistinct or a histogram and so forth. There won't be a
general answer as the percentages will be distribution dependend and
statistic dependent.
Cheers,
Sergey
PS I'm not a statistician, but I use statistics a lot
*******************************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2013-12-11 01:37:34 | Re: pg_stat_statements fingerprinting logic and ArrayExpr |
Previous Message | Tom Lane | 2013-12-11 01:25:04 | Re: Why we are going to have to go DirectIO |