strange estimate for number of rows

From: Andrew Sullivan <andrew(at)libertyrms(dot)info>
To: pgsql-performance(at)postgresql(dot)org
Subject: strange estimate for number of rows
Date: 2003-11-13 18:14:15
Message-ID: 20031113181415.GB25546@libertyrms.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all,

I've one here that I cannot fathom. Any suggestions?

We have a table, call it tablename, where we're selecting by a range
of dates and an identifier. (This is redacted, obviously):

\d tablename

Column | Type | Modifiers
--------------------+--------------------------+--------------------
id | integer | not null
transaction_date | timestamp with time zone | not null
product_id | integer | not null
Indexes:
"trans_posted_trans_date_idx" btree (transaction_date, product_id)

The statistics on transaction_date and product_id are set to 1000.
Everything is all analysed nicely. But I'm getting a poor plan,
because of an estimate that the number of rows to be returned is
about double how many actually are:

explain analyse select * from transactions_posted where
transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
product_id = 2;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on transactions_posted (cost=0.00..376630.33 rows=700923
width=91) (actual time=8422.253..36176.078 rows=316029 loops=1)
Filter: ((transaction_date >= '2003-09-01 00:00:00-04'::timestamp
with time zone) AND (transaction_date < '2003-10-01
00:00:00-04'::timestamp with time zone) AND (product_id = 2))
Total runtime: 36357.630 ms
(3 rows)

SET enable_seqscan = off;

explain analyse select * from transactions_posted where
transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
product_id = 2;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trans_posted_trans_date_idx on transactions_posted
(cost=0.00..1088862.56 rows=700923 width=91) (actual
time=35.214..14816.257 rows=316029 loops=1)
Index Cond: ((transaction_date >= '2003-09-01
00:00:00-04'::timestamp with time zone) AND (transaction_date <
'2003-10-01 00:00:00-04'::timestamp with time zone) AND (product_id =
2))
Total runtime: 15009.816 ms
(3 rows)

SELECT attname,null_frac,avg_width,n_distinct,correlation FROM
pg_stats where tablename = 'transactions_posted' AND attname in
('transaction_date','product_id');
attname | null_frac | avg_width | n_distinct | correlation
------------------+-----------+-----------+------------+-------------
product_id | 0 | 4 | 2 | 0.200956
transaction_date | 0 | 8 | -0.200791 | 0.289248

Any ideas? I'm loathe to recommend cluster, since the data will not
stay clustered.

A

--
----
Andrew Sullivan 204-4141 Yonge Street
Afilias Canada Toronto, Ontario Canada
<andrew(at)libertyrms(dot)info> M2P 2A8
+1 416 646 3304 x110

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Arthur Ward 2003-11-13 18:28:15 Union+group by planner estimates way off?
Previous Message Robert Treat 2003-11-13 17:19:42 Re: IN surpasses NOT EXISTS in 7.4RC2 ??