Quick Links

Re: Optimizer produces wildly different row count estimate depending on casts

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Shelby Cain <alyandon(at)yahoo(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Optimizer produces wildly different row count estimate depending on casts
Date:	2004-03-08 22:14:10
Message-ID:	16813.1078784050@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Shelby Cain <alyandon(at)yahoo(dot)com> writes:
> Consider the following queries:

> #1) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > '07-FEB-2004' and
> txn_date <= current_date group by txn_code;

> #2) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > current_date - 30 and
> txn_date <= current_date group by txn_code;

> For the queries above, the optimizer correctly
> identified that using the index is a big gain over a
> sequential scan and the estimated vs actual row counts
> are within striking distance of one another.

The optimizer has no idea about the selectivity of the clauses involving
current_date, since it hasn't got a constant value to compare to the
statistical histogram. (In the case of expressions involving
current_date and similar functions, it might be reasonable to compute
the current value and use that as an estimate, but I'm unsure of the
conditions under which that's safe. In any case there is no such code
at present.) So for query #1 we have a hopefully-pretty-good estimate
for "txn_date > '07-FEB-2004'", and then we knock that down by an
arbitrary percentage because we don't have a clue about "txn_date <=
current_date"; which is why the estimate is too small. In query #2 we
are, plain and simply, guessing. The optimizer can however see that
this is a range constraint on txn_date, and the default guess in such
cases is chosen to favor an indexscan.

> #3) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > (current_date -
> 30)::timestamp and txn_date <= current_date group by
> txn_code;

The first clause is not indexable (before 7.5) because the effective
meaning is txn_date::timestamp > something, and your index is not on
txn_date::timestamp. Indexing using only the second clause would be
a loser and the planner won't do it. What's more, this is no longer
obviously a range limitation, because the constrained variables aren't
the same. So the estimate is pretty stupid --- it assumes the two WHERE
clauses are independent.

> #4) explain analyze select txn_code, count(*) from
> inventory_txns where txn_date > (current_date -
> 30)::timestamp and txn_date <= current_date::timestamp
> group by txn_code;

Neither clause is indexable, but at least the planner can see that it's
a range constraint on txn_date::timestamp, so you get a smaller
guesstimate.

regards, tom lane

In response to

Optimizer produces wildly different row count estimate depending on casts at 2004-03-08 21:30:04 from Shelby Cain

Responses

Re: Optimizer produces wildly different row count estimate depending on casts at 2004-03-08 23:57:50 from Shelby Cain

Browse pgsql-general by date

	From	Date	Subject
Next Message	Shelby Cain	2004-03-08 23:57:50	Re: Optimizer produces wildly different row count estimate depending on casts
Previous Message	Joe Conway	2004-03-08 21:47:52	Re: creating sequential timestamp