From: | Josh Berkus <josh(at)agliodbs(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: pg_statistics and sample size WAS: Overhauling GUCS |
Date: | 2008-06-10 15:34:51 |
Message-ID: | 200806100834.51471.josh@agliodbs.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Greg,
> The analogous case in our situation is not having 300 million distinct
> values, since we're not gathering info on specific values, only the
> buckets. We need, for example, 600 samples *for each bucket*. Each bucket
> is chosen to have the same number of samples in it. So that means that we
> always need the same number of samples for a given number of buckets.
I think that's plausible. The issue is that in advance of the sampling we
don't know how many buckets there *are*. So we first need a proportional
sample to determine the number of buckets, then we need to retain a histogram
sample proportional to the number of buckets. I'd like to see someone with a
PhD in this weighing in, though.
> Really? Could you send references? The paper I read surveyed previous work
> and found that you needed to scan up to 50% of the table to get good
> results. 50-250% is considerably looser than what I recall it considering
> "good" results so these aren't entirely inconsistent but I thought previous
> results were much worse than that.
Actually, based on my several years selling performance tuning, I generally
found that as long as estimates were correct within a factor of 3 (33% to
300%) the correct plan was generally chosen.
There are papers on block-based sampling which were already cited on -hackers;
I'll hunt through the archives later.
--
Josh Berkus
PostgreSQL @ Sun
San Francisco
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-06-10 15:37:12 | Re: Timezone abbreviations - out but not in? |
Previous Message | Merlin Moncure | 2008-06-10 15:29:17 | Re: libpq support for arrays and composites |