| From: | Peter Geoghegan <peter(at)2ndquadrant(dot)com> | 
|---|---|
| To: | Stefan Andreatta <s(dot)andreatta(at)synedra(dot)com> | 
| Cc: | pgsql-performance(at)postgresql(dot)org | 
| Subject: | Re: serious under-estimation of n_distinct for clustered distributions | 
| Date: | 2012-12-29 21:57:04 | 
| Message-ID: | CAEYLb_VDRSye1zieFcuKH2ZnCaht11O5DZX8-cWRs9CKs0M-cg@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-performance | 
On 29 December 2012 20:57, Stefan Andreatta <s(dot)andreatta(at)synedra(dot)com> wrote:
> Now, the 2005 discussion goes into great detail on the advantages and
> disadvantages of this algorithm, particularly when using small sample sizes,
> and several alternatives are discussed. I do not know whether anything has
> been changed after that, but I know that the very distinct problem, which I
> will focus on here, still persists.
It's a really hard problem to solve satisfactorily. It's a problem
that has been studied in much detail. Yes, the algorithm used is still
the same. See the comments within src/backend/commands/analyze.c (IBM
Research Report RJ 10025 is referenced there).
The general advice here is:
1) Increase default_statistics_target for the column.
2) If that doesn't help, consider using the following DDL:
alter table foo alter column bar set ( n_distinct = 5.0);
-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Stefan Andreatta | 2012-12-30 18:02:44 | Re: serious under-estimation of n_distinct for clustered distributions | 
| Previous Message | Stefan Andreatta | 2012-12-29 20:57:04 | serious under-estimation of n_distinct for clustered distributions |