From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | Daniel Verite <daniel(at)manitou-mail(dot)org> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Unicode normalization SQL functions |
Date: | 2020-01-09 09:20:14 |
Message-ID: | 2309023a-6f69-f049-70e5-3c70b4fb9672@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2020-01-06 17:00, Daniel Verite wrote:
> Peter Eisentraut wrote:
>
>> Also, there is a way to optimize the "is normalized" test for common
>> cases, described in UTR #15. For that we'll need an additional data
>> file from Unicode. In order to simplify that, I would like my patch
>> "Add support for automatically updating Unicode derived files"
>> integrated first.
>
> Would that explain that the NFC/NFKC normalization and "is normalized"
> check seem abnormally slow with the current patch, or should
> it be regarded independently of the other patch?
That's unrelated.
> For instance, testing 10000 short ASCII strings:
>
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfc normalized ;
> count
> -------
> 10000
> (1 row)
>
> Time: 2573,859 ms (00:02,574)
>
> By comparison, the NFD/NFKD case is faster by two orders of magnitude:
>
> postgres=# select count(*) from (select md5(i::text) as t from
> generate_series(1,10000) as i) s where t is nfd normalized ;
> count
> -------
> 10000
> (1 row)
>
> Time: 29,962 ms
>
> Although NFC/NFKC has a recomposition step that NFD/NFKD
> doesn't have, such a difference is surprising.
It's very likely that this is because the recomposition calls
recompose_code() which does a sequential scan of UnicodeDecompMain for
each character. To optimize that, we should probably build a bespoke
reverse mapping table that can be accessed more efficiently.
> I've tried an alternative implementation based on ICU's
> unorm2_isNormalized() /unorm2_normalize() functions (which I'm
> currently adding to the icu_ext extension to be exposed in SQL).
> With these, the 4 normal forms are in the 20ms ballpark with the above
> test case, without a clear difference between composed and decomposed
> forms.
That's good feedback.
> Independently of the performance, I've compared the results
> of the ICU implementation vs this patch on large series of strings
> with all normal forms and could not find any difference.
And that too.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Fabien COELHO | 2020-01-09 09:28:21 | Re: pgbench - use pg logging capabilities |
Previous Message | Peter Eisentraut | 2020-01-09 09:16:19 | Re: Add support for automatically updating Unicode derived files |