Re: [SQL] [PERFORM] 7.3.1 index use / performance

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-performance(at)postgresql(dot)org, pgsql-general(at)postgresql(dot)org, pgsql-sql(at)postgresql(dot)org
Subject: Re: [SQL] [PERFORM] 7.3.1 index use / performance
Date: 2003-01-08 17:32:33
Message-ID: 4828.1042047153@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-performance pgsql-sql

Just to close off the thread, here is the end-result of investigating
Achilleus Mantzios' problem.

------- Forwarded Message

Date: Wed, 08 Jan 2003 11:54:36 -0500
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Achilleus Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
Subject: Re: [SQL] [PERFORM] 7.3.1 index use / performance

I believe I see what's going on. You have a number of silly outlier
values in the report_date column --- quite a few instances of '10007-06-09'
for example. Depending on whether ANALYZE's random sample happens to
include one of these, the histogram generated by ANALYZE might look like
this (it took about half a dozen tries with ANALYZE to get this result):

dynacom=# analyze noon;
ANALYZE
dynacom=# select histogram_bounds from pg_stats where attname = 'report_date';
histogram_bounds
-----------------------------------------------------------------------------------------------------------------------------
{1969-06-26,1994-09-24,1996-04-05,1997-07-21,1998-08-27,1999-03-13,1999-11-11,2000-08-18,2001-04-18,2002-01-04,10007-06-09}
(1 row)

in which case we get this:

dynacom=# EXPLAIN select * from noon where
dynacom-# report_date between '2002-01-07' and '2003-01-07';
QUERY PLAN
---------------------------------------------------------------------------------------------
Index Scan using noonf_date on noon (cost=0.00..4.08 rows=1 width=1975)
Index Cond: ((report_date >= '2002-01-07'::date) AND (report_date <= '2003-01-07'::date))
(2 rows)

Seeing this histogram, the planner assumes that one-tenth of the table
is uniformly distributed between 2002-01-04 and 10007-06-09, which leads
it to the conclusion that the range between 2002-01-07 and 2003-01-07
probably contains only about one row, which causes it to prefer a scan
on report_date rather than on v_code.

The reason the problem comes and goes is that any given ANALYZE run
might or might not happen across one of the outliers. When it doesn't,
you get a histogram that leads to reasonably accurate estimates.

There are a couple of things you could do about this. One is to
increase the statistics target for report_date (see ALTER TABLE SET
STATISTICS) so that a finer-grained histogram is generated for the
report_date column. The other thing, which is more work but probably
the best answer in the long run, is to fix the outliers, which I imagine
must be incorrect entries.

You could perhaps put a constraint on report_date to prevent bogus
entries from sneaking in in future.

It looks like increasing the stats target would be worth doing also,
if you make many queries using ranges of report_date.

regards, tom lane

------- End of Forwarded Message

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2003-01-08 17:42:47 Re: Strange Error
Previous Message Gavin M. Roy 2003-01-08 17:21:15 Re: Get PostgreSQL work with Kylix 3 ?

Browse pgsql-performance by date

  From Date Subject
Next Message Achilleus Mantzios 2003-01-08 17:53:54 Re: [SQL] [PERFORM] 7.3.1 index use / performance
Previous Message enediel 2003-01-08 17:18:09 postgresql in cluster of servers

Browse pgsql-sql by date

  From Date Subject
Next Message Zengfa Gao 2003-01-08 17:43:31 PgSQL 7.3: /opt/pgsql/bin/postgres: Execute permission denied.
Previous Message Bruce Momjian 2003-01-08 17:26:48 Re: What benefits can I expect from schemas ?