Quick Links

Re: serious under-estimation of n_distinct for clustered distributions

From:	Stefan Andreatta <s(dot)andreatta(at)synedra(dot)com>
To:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: serious under-estimation of n_distinct for clustered distributions
Date:	2013-01-04 05:14:34
Message-ID:	50E6653A.4060900@synedra.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On 12/29/2012 10:57 PM, Peter Geoghegan wrote:
> On 29 December 2012 20:57, Stefan Andreatta <s(dot)andreatta(at)synedra(dot)com> wrote:
...

> The general advice here is:
>
> 1) Increase default_statistics_target for the column.

I tried that, but to get good estimates under these circumstances, I
need to set the statistics_target so high that the whole table gets
analyzed. As this problem matters most for all of our large tables, I
would have to set default_statistics_target to something like 100000 -
that's a bit scary for production systems with tables of appr. 100GB, I
find.

> 2) If that doesn't help, consider using the following DDL:
>
> alter table foo alter column bar set ( n_distinct = 5.0);
>

Yes, that's probably best - even if it means quite some maintenance
work. I do it like that:

ALTER TABLE test_1 ALTER COLUMN clustered_random_2000k SET (n_distinct
= -0.05);

btw: Postgres will never set relative n_distinct values for anything
larger than -0.1. If I determine (or know) it to be a constant but lower
fraction, could it be a problem to explicitly set this value to between
-0.1 and 0?

To activate that setting, however, an ANALYZE has to be run. That was
not clear to me from the documentation:

ANALYZE verbose test_1;

To check column options and statistics values:

SELECT pg_class.relname AS table_name,
pg_attribute.attname AS column_name, pg_attribute.attoptions
FROM pg_attribute
JOIN pg_class ON pg_attribute.attrelid = pg_class.oid
WHERE pg_attribute.attnum > 0
AND pg_class.relname = 'test_1'
AND pg_attribute.attname = 'clustered_random_2000k';

SELECT tablename AS table_name, attname AS column_name,
null_frac, avg_width, n_distinct, correlation
FROM pg_stats
WHERE tablename = 'test_1' and attname = 'clustered_random_2000k';

And finally, we can undo the whole thing, if necessary:

ALTER TABLE test_1 ALTER COLUMN clustered_random_2000k RESET (n_distinct);
ANALYZE VERBOSE test_1;

Regards,
Stefan

In response to

Re: serious under-estimation of n_distinct for clustered distributions at 2012-12-29 21:57:04 from Peter Geoghegan

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Scott Marlowe	2013-01-04 07:42:47	Re: SMP on a heavy loaded database
Previous Message	Stefan Andreatta	2013-01-04 04:33:09	Re: Simple join doesn't use index