From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [v9.2] make_greater_string() does not return a string in some cases |
Date: | 2011-09-22 04:24:04 |
Message-ID: | 8755.1316665444@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I'm a bit perplexed as to why we can't find a non-stochastic way of doing this.
Because the behavior of common collation algorithms is so wacky as to
approach stochasticity. In particular, as soon as your string contains
a mix of letter and non-letter characters, "dictionary" algorithms tend
to behave really strangely. We went around on this quite a few times
before settling on the current approach; we kept thinking exactly what
you're thinking, namely that we ought to be able to predict more than
nothing about what the collation algorithm would consider larger.
An example of the weirdness is that in en_US collation, we have
postgres=# select '/root/' < '/root0';
?column?
----------
t
(1 row)
postgres=# select '/root/t' < '/root0';
?column?
----------
f
(1 row)
It was cases like this that forced us to give up on using non-C-locale
indexes for LIKE optimization, because the derived greater-than and
less-than conditions have to be 100% correct for that. What we're still
using make_greater_string for in non-C locales is to generate estimates
about the selectivity of LIKE prefix conditions; for that, some amount
of error is tolerable and so we just live with effects like the above.
(We have to deal with the locale's sort order, whatever the heck it is,
because that's what the pg_statistic histogram is computed in.)
Now, having said that, I'm starting to wonder again why it's worth our
trouble to fool with encoding-specific incrementers. The exactness of
the estimates seems unlikely to be improved very much by doing this.
>>> One random idea I have is - instead of generating > and < clauses,
>>> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
>>> 1, length(b)) = b? We'd need to do something about the selectivity,
>>> but I don't see why that would be a problem.
>> The problem is that you'd need to make that a btree-indexable operator.
> Well, right. Without that, there's not much point. But do you think
> that's prohibitively difficult?
The problem is that you'd just be shifting all these same issues into
the btree index machinery, which is not any better equipped to cope with
them, and would not be a good place to be adding overhead.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | pratikchirania | 2011-09-22 04:47:35 | Re: Timezone issues with Postrres |
Previous Message | Robert Haas | 2011-09-22 03:04:02 | Re: [v9.2] make_greater_string() does not return a string in some cases |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2011-09-22 04:51:35 | Re: EXPLAIN and nfiltered, take two |
Previous Message | David E. Wheeler | 2011-09-22 03:18:26 | Re: citext operator precedence fix |