Re: Setting Statistics on Functional Indexes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: sthomas(at)optionshouse(dot)com
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Setting Statistics on Functional Indexes
Date: 2012-10-26 21:08:18
Message-ID: 3040.1351285698@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Shaun Thomas <sthomas(at)optionshouse(dot)com> writes:
> But I just noticed the lag in your response. :) It turns out, even
> though I was substituting 2012-10-24 or 2012-10-25, what I really meant
> was current_date. That does make all the difference, actually.

Ah. [ pokes at that for awhile... ] OK, this has nothing to do with
functional indexes, and everything to do with the edge-case behavior of
scalarltsel. What you've got is a histogram whose last entry
(corresponding to the highest observed value of the date) is
current_date, and the question is what we should assume when estimating
how many rows have a value >= that. The answer of course is "one, plus
any duplicates" ... but we don't know how many duplicates there are,
and what we do know is it's not a particularly large number because the
value isn't present in the most-common-values stats. So the code there
assumes there aren't any dups.

Once you have enough histogram resolution for current_date to show up
as the next-to-last as well as the last histogram entry, then of course
the estimate gets a lot better, since we can now tell that there's at
least one histogram bin's worth of duplicates.

Interestingly, this is a case where the get_actual_variable_range patch
(commit 40608e7f, which appeared in 9.0) makes the results worse.
Before that, there was a (very arbitrary) lower bound on what we'd
believe as the selectivity of a >= condition, but now, when we know the
actual upper limit of the variable, we don't clamp the result that way.
I think the clamp must have been saving you in your previous version,
because it more-or-less-accidentally accounted for the end value not
being unique.

So the bottom line is that this is a case where you need a lot of
resolution in the histogram. I'm not sure there's anything good
we can do to avoid that. I spent a bit of time thinking about whether
we could use n_distinct to get some idea of how many duplicates there
might be for the endpoint value, but n_distinct is unreliable enough
that I can't develop a lot of faith in such a thing. Or we could just
arbitarily assume some fraction-of-a-histogram-bin's worth of
duplicates, but that would make the results worse for some people.

regards, tom lane

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Claudio Freire 2012-10-26 21:19:05 Re: Setting Statistics on Functional Indexes
Previous Message robcron 2012-10-26 20:56:56 Re: Slower Performance on Postgres 9.1.6 vs 8.2.11