From: | "Daniel Verite" <daniel(at)manitou-mail(dot)org> |
---|---|
To: | "Peter Eisentraut" <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Cc: | "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Unicode normalization SQL functions |
Date: | 2020-01-28 09:48:45 |
Message-ID: | 623fa07e-348f-4273-afa4-7110ad43ca57@manitou-mail.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Peter Eisentraut wrote:
> Here is an updated patch set that now also implements the "quick check"
> algorithm from UTR #15 for making IS NORMALIZED very fast in many cases,
> which I had mentioned earlier in the thread.
I found a bug in unicode_is_normalized_quickcheck() which is
triggered when the last codepoint of the string is beyond
U+10000. On encountering it, it does:
+ if (is_supplementary_codepoint(ch))
+ p++;
When ch is the last codepoint, it makes p point to
the ending zero, but the subsequent p++ done by
the for loop makes it miss the exit and go into over-reading.
But anyway, what's the reason for skipping the codepoint
following a codepoint outside of the BMP?
I've figured that it comes from porting the Java code in UAX#15:
public int quickCheck(String source) {
short lastCanonicalClass = 0;
int result = YES;
for (int i = 0; i < source.length(); ++i) {
int ch = source.codepointAt(i);
if (Character.isSupplementaryCodePoint(ch)) ++i;
short canonicalClass = getCanonicalClass(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return NO; }
int check = isAllowed(ch);
if (check == NO) return NO;
if (check == MAYBE) result = MAYBE;
lastCanonicalClass = canonicalClass;
}
return result;
}
source.length() is the length in UTF-16 code units, in which a surrogate
pair counts for 2. This would be why it does
if (Character.isSupplementaryCodePoint(ch)) ++i;
it's meant to skip the 2nd UTF-16 code of the pair.
As this does not apply to the 32-bit pg_wchar, I think the two lines above
in the C implementation should just be removed.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2020-01-28 09:56:26 | Re: The flinfo->fn_extra question, from me this time. |
Previous Message | Amit Kapila | 2020-01-28 09:47:50 | Re: [HACKERS] Block level parallel vacuum |