Optimization for lower(), upper(), casefold() functions.

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Optimization for lower(), upper(), casefold() functions.
Date: 2025-01-29 20:23:45
Message-ID: 7cac7e66-9a3b-4e3f-a997-42aa0c401f80@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, hackers!

I propose to consider a simple optimization for Unicode case tables.
The main changes affect the generate-unicode_case_table.pl file.

Because of the modified approach of record search by table we
managed to:
1. Removed storing Unicode codepoints (unsigned int) in all tables.
2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
4. Reduced the time to find a record in the table.
5. Reduce the size of the final object file.

The approach is generally as follows:
Group Unicode codepoints into ranges in which the difference between
neighboring elements does not exceed the specified limit.
For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
there is a difference of 2 between 3 and 5, which is greater than 1,
so there will be ranges 1-3 and 5-6.

Then we form a table (let's call it an index table) by combining the
obtained ranges. The table contains uint16_t index to the main table.

Then from the previously obtained diapasons we form a function
(get_case()) to get the index to the main table. The function, in fact,
contains only IF/ELSE IF constructs imitating binary search.

Because we are not directly accessing the main table with data, we can
exclude duplicates from it, and there are almost half of them.
Also, because get_case() contains all the information about Unicode
ranges, we don't need to store Unicode codepoints in the main table.
Also because of this approach some checks were removed, which allowed
to increase performance even with fast path (codepoints < 0x80).

casefold() test.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 282.457745
Without: tps = 263.749652

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 82.399637
Without: tps = 48.291034

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 120.703471
Without: tps = 92.423490

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 172.291972
Without: tps = 111.592281

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 36.487650
Without: tps = 22.537515

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 55.190635
Without: tps = 45.493104

--
SberTech

Alexander Borisov

Attachment Content-Type Size
0001-Optimization-for-lower-upper-casefold-functions.patch text/plain 657.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Borisov 2025-01-29 20:52:59 Re: Optimization for lower(), upper(), casefold() functions.
Previous Message Corey Huinker 2025-01-29 20:04:40 Re: Extended Statistics set/restore/clear functions.