Re: Optimization for lower(), upper(), casefold() functions.

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-04 20:19:57
Message-ID: 340f2451-cd76-487b-a1e4-0f0a2fe91fba@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

31.01.2025 16:13, Alexander Borisov пишет:
> 31.01.2025 01:43, Heikki Linnakangas пишет:

..

>
> Thanks, after the weekend I'll send an updated patch that takes into
> account the comments/advice.

I've done many different experiments and everywhere the result is within
the margin of the v2 patch result.

The v3 patch contains the following changes:
1. Removed storing Unicode codepoints (unsigned int) in all tables.
2. Reduce the main table from 3003 to 1677 (duplicates removed) records.
3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
4. Partitioning the main table into tables by case type.
5. Reduced the time to find a record in the table.
6. Reduce the size of the final object file.

Different denser packing of data led to more complicated code, but the
result remained essentially the same.

Of course, the main thing that has been accomplished:
Increase processing speed.
Reduce the size of tables and, consequently, the size of the object file.

casefold() test.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 278.449809
Without: tps = 266.526168

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 86.740680
Without: tps = 49.373695

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 102.221092
Without: tps = 92.477798

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 146.712371
Without: tps = 120.794307

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 44.499567
Without: tps = 24.237999

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 54.354833
Without: tps = 46.556531

--
Alexander Borisov

Attachment Content-Type Size
v3-0001-Optimization-for-lower-upper-casefold-functions.patch text/plain 704.8 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-02-04 20:23:56 Re: should we have a fast-path planning for OLTP starjoins?
Previous Message Jeff Davis 2025-02-04 20:11:29 Re: new commitfest transition guidance