From: | John Naylor <john(dot)naylor(at)2ndquadrant(dot)com> |
---|---|
To: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se> |
Subject: | Re: Unicode normalization SQL functions |
Date: | 2020-03-26 17:41:43 |
Message-ID: | CACPNZCsiE6qN=yc-LBNLU5b9Hy06yagOXwsWGTJTvpXcrMHNTw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Mar 26, 2020 at 3:26 PM Peter Eisentraut
<peter(dot)eisentraut(at)2ndquadrant(dot)com> wrote:
> I have figured this out. New patch is attached.
>
> First, I have added #ifndef FRONTEND, as mentioned above, so libpq isn't
> bloated. Second, I have changed the lookup structure to a bitfield, so
> each entry is only 32 bits instead of 64. Third, I have dropped the
> quickcheck tables for the NFD and NFKD forms. Those are by far the
> biggest tables, and you still get okay performance if you do the
> normalization check the long way, since we don't need the recomposition
> step on those cases, which is by far the slowest part. The main use
> case of all of this, I expect, is to check for NFC normalization, so
> it's okay if the other variants are not optimized to the same extent.
Reading the link cited in the patch
http://www.unicode.org/reports/tr15/#Detecting_Normalization_Forms
"The data for the implementation of the isAllowed() call can be
accessed in memory with a hash table or a trie (see Section 14,
Implementation Notes); the latter will be the fastest."
We don't have a trie implementation in Postgres, but we do have a
perfect hash implementation. Doing that would bring the tables back to
64 bits per entry, but would likely be noticeably faster than binary
search. Since v4 has left out the biggest tables entirely, I think
this might be worth a look for the smaller tables remaining.
In the attached v5, when building the hash tables, we sort the code
points by NO/MAYBE, and store the index of the beginning of the NO
block:
MMMNNNNNNNNN
~~~^
That way we can tell a NO from a MAYBE by testing the result of the hash lookup.
Regression tests pass, but I haven't measured performance yet. I had
to fiddle with the hash seeds a bit to get the larger table to build.
Also, if we go with v4, I noticed the following test is present twice:
+SELECT "normalize"('abc', 'def'); -- run-time error
--
John Naylor https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
v5-Add-SQL-functions-for-Unicode-normalization.patch | application/x-patch | 193.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2020-03-26 17:49:11 | Re: proposal \gcsv |
Previous Message | Mark Dilger | 2020-03-26 17:40:55 | Re: backup manifests |