From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Peter Geoghegan <pg(at)bowt(dot)ie> |
Cc: | Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: insensitive collations |
Date: | 2019-02-21 08:36:12 |
Message-ID: | b6b28f54-a667-87c2-870f-c9f07dd796ee@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2019-02-21 03:17, Peter Geoghegan wrote:
> I wonder if it would be better to break this into distinct commits?
I thought about that. Especially the planner/executor changes could be
done separately, sort of as a way to address the thread
"ExecBuildGroupingEqual versus collations". But I'm not sure if they
would have good test coverage on their own. I can work on this if
people think this would be useful.
> * Why is support for non-deterministic comparisons limited to the ICU
> provider? If that's the only case that could possibly be affected,
> then why did we ever add the varstrcmp() tie-breaker call to strcmp()
> in the first place? The behavior described in the commit message of
> bugfix commit 656beff5 describes a case where Hungarian text caused
> index corruption by being strcoll()-wise equal but not bitwise equal.
> Besides, I think that you can vendor your own case insensitive
> collation with glibc, since it's based on UCA.
The original test case (described here:
https://www.postgresql.org/message-id/27064.1134753128%40sss.pgh.pa.us)
no longer works, so it was probably fixed on the glibc side. The git
log of the hu_HU locale definition shows that it has been "fixed" in
major ways several times over the years, so that's plausible.
I tried reproducing some more practical scenarios involving combining
diacritics, but glibc apparently doesn't believe strings in different
normal forms are equal. At that point I gave up because this doesn't
seem worthwhile to support.
Moreover, I think allowing this would require a "trusted" strxfrm(),
which is currently disabled.
>> + return DatumGetBool(FunctionCall2Coll(&entry->eq_opr_finfo,
>> + DEFAULT_COLLATION_OID,
>> + oldvalue, newvalue));
>> }
>
> The existing restriction on ICU collations (that they cannot be
> database-wide) side-steps the issue.
>
> * Can said restriction somehow be lifted? That seems like it would be
> a lot cleaner.
Lift what restriction? That ICU collations cannot be database-wide?
Sure that would be nice, but it's a separate project. Even then, I'm
not sure that we would allow a database-wide collation to be
nondeterministic. That would for example disallow the use of LIKE,
which would be weird. In any case, the above issue can be addressed
then. I think it's not worth complicating this right now.
> * Why have you disable this optimization?:
>
>> /* Fast pre-check for equality, as discussed in varstr_cmp() */
>> - if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>> + if ((!sss->locale || sss->locale->deterministic) &&
>> + len1 == len2 && memcmp(a1p, a2p, len1) == 0)
>
> I don't see why this is necessary. A non-deterministic collation
> cannot indicate that bitwise identical strings are non-equal.
Right, I went too far there.
> * Perhaps you should add a "Tip" referencing the feature to the
> contrib/citext documentation.
Good idea.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | mitani | 2019-02-21 08:44:00 | [PATCH] Allow transaction to partition table using FDW |
Previous Message | Jamison, Kirk | 2019-02-21 08:20:55 | RE: Timeout parameters |