Speed up ICU case conversion by using ucasemap_utf8To*()

From: Andreas Karlsson <andreas(at)proxel(dot)se>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Speed up ICU case conversion by using ucasemap_utf8To*()
Date: 2024-12-20 05:20:38
Message-ID: 167986ff-afcf-4542-94c6-61ee8474e138@proxel.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Jeff pointed out to me that the case conversion functions in ICU have
UTF-8 specific versions which means we can call those directly if the
database encoding is UTF-8 and skip having to convert to and from UChar.

Since most people today run their databases in UTF-8 I think this
optimization is worth it and when measuring on short to medium length
strings I got a 15-20% speed up. It is still slower than glibc in my
benchmarks but the gap is smaller now.

SELECT count(upper) FROM (SELECT upper(('Kålhuvud ' || i) COLLATE
"sv-SE-x-icu") FROM generate_series(1, 1000000) i);

master: ~540 ms
Patched: ~460 ms
glibc: ~410 ms

I have also attached a clean up patch for the non-UTF-8 code paths. I
thought about doing the same for the new UTF-8 code paths but it turned
out to be a bit messy due to different function signatures for
ucasemap_utf8ToUpper() and ucasemap_utf8ToLower() vs ucasemap_utf8ToTitle().

Andreas

Attachment Content-Type Size
v1-0001-Use-optimized-versions-of-ICU-case-conversion-for.patch text/x-patch 6.7 KB
v1-0002-Reduce-code-duplication-in-ICU-case-mapping-code.patch text/x-patch 3.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2024-12-20 05:23:20 Re: Statistics Import and Export
Previous Message Amit Langote 2024-12-20 04:23:35 Eliminating SPI / SQL from some RI triggers - take 3