From: | Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Optimization for lower(), upper(), casefold() functions. |
Date: | 2025-01-29 20:23:45 |
Message-ID: | 7cac7e66-9a3b-4e3f-a997-42aa0c401f80@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, hackers!
I propose to consider a simple optimization for Unicode case tables.
The main changes affect the generate-unicode_case_table.pl file.
Because of the modified approach of record search by table we
managed to:
1. Removed storing Unicode codepoints (unsigned int) in all tables.
2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
4. Reduced the time to find a record in the table.
5. Reduce the size of the final object file.
The approach is generally as follows:
Group Unicode codepoints into ranges in which the difference between
neighboring elements does not exceed the specified limit.
For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
there is a difference of 2 between 3 and 5, which is greater than 1,
so there will be ranges 1-3 and 5-6.
Then we form a table (let's call it an index table) by combining the
obtained ranges. The table contains uint16_t index to the main table.
Then from the previously obtained diapasons we form a function
(get_case()) to get the index to the main table. The function, in fact,
contains only IF/ELSE IF constructs imitating binary search.
Because we are not directly accessing the main table with data, we can
exclude duplicates from it, and there are almost half of them.
Also, because get_case() contains all the information about Unicode
ranges, we don't need to store Unicode codepoints in the main table.
Also because of this approach some checks were removed, which allowed
to increase performance even with fast path (codepoints < 0x80).
casefold() test.
* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)
ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 282.457745
Without: tps = 263.749652
Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 82.399637
Without: tps = 48.291034
Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 120.703471
Without: tps = 92.423490
* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)
ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 172.291972
Without: tps = 111.592281
Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 36.487650
Without: tps = 22.537515
Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 55.190635
Without: tps = 45.493104
--
SberTech
Alexander Borisov
Attachment | Content-Type | Size |
---|---|---|
0001-Optimization-for-lower-upper-casefold-functions.patch | text/plain | 657.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Borisov | 2025-01-29 20:52:59 | Re: Optimization for lower(), upper(), casefold() functions. |
Previous Message | Corey Huinker | 2025-01-29 20:04:40 | Re: Extended Statistics set/restore/clear functions. |