Re: Optimization for lower(), upper(), casefold() functions.

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-11 20:08:33
Message-ID: 1e3f4f6f-9b12-4109-b0d2-a4c52b5064f8@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

06.02.2025 22:08, Jeff Davis пишет:
> On Thu, 2025-02-06 at 18:39 +0300, Alexander Borisov wrote:
>> Since I started to improve Unicode Case, I used the same approach,
>> essentially a binary search, only not by individual values, but by
>> ranges.
>
> I considered it a 4th approach because of the generated branches in
> case_index(). Case_index() is essentially a part of the data structure
> -- you can't navigate the table without it.
>
> By the way, could we just represent those ranges as another tiny table
> instead of representing them as branches?
>
> In other words, a table something like:
>
> {{.start=0x0000, .end=0x0588, .offset=0},
> {.start=0x10A0, .end=0x1100, .offset=0x10A0 - 0x0588},
> ...
> {.start=0x1E900, .end=0x1E944, .offset=0x1E900 - 0x11D3}}
>
> How would that perform?

I tried the approach via a range table. The result was worse than
without the table. With branching in a function, the result is better.

Patch v3 — ranges binary search by branches.
Patch v4 — ranges binary search by table.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch v3: tps = 102.816759
Patch v4: tps = 99.365996
Without: tps = 92.267398

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch v3: tps = 54.278283
Patch v4: tps = 49.895410
Without: tps = 46.212763

>
>> I plan to use the same approach for Unicode Normalization. Acting
>> sequentially and suggesting improvements.
>> Unfortunately I'm new here and I can't say why one place uses one
>> approach and another another - I just don't know.
>
> We have 3 or 4 different solutions (naive binary search, range binary
> search, perfect hashing, and radix tree) to 3 or 4 similar problems
> (normalization, character properties, case mapping, and encoding
> conversion). We should consolidate these approaches somehow.
>
> IIUC, out of the solutions, it seems that either your modified binary
> search or the radix tree are best because they are both fast and both
> allow compact tables.
>
> My questions are: is there a clear winner between radix tree and the
> range binary search for all four problems? If not, what are the trade-
> offs?

Comparing Radix Tree and Range Binary.

The BIG5 encoding was taken to test the different approaches.
BIG5 is taken because of the large table and fragmented data.
BIG5 to UTF-8 and UTF-8 to BIG5.

Changed big5_to_utf8() and utf8_to_big5() functions in
src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c file.

The SQL used for testing were:
SET bytea_output = 'hex';
SELECT CONVERT('\x...'::bytea, 'BIG5', 'UTF8');
and
SELECT CONVERT('\x...'::bytea, 'UTF8', 'BIG5');

All data from table BIG5.txt was used for test. The data was repeated
to create a 1MB query.

Attached is the v1-0001-Comparison-two-algorithms-for-Encoding.patch.
This patch is just an experiment, it's not production code, it's not
good code, just to compare the two algorithms.

The bottom line of the comparison.

BIG5 to UTF-8 table size:
Radix Tree: 17088
Binary Range: 22685

UTF-8 to BIG5 table size:
Radix Tree: 22839
Binary Range: 24875

Benchmarks:

used LocalToUtf — this means that the native function was used without
changes and the function for searching the value by Range Binary was
passed to it.

used LocalToUtfRange — a modified function was used.

BIG5 to UTF-8:

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

Patch (used LocalToUtf): tps = 165.447580
Patch (used LocalToUtfRange): tps = 168.514145
Without: tps = 160.552316

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

Patch (used LocalToUtf): tps = 80.527047
Patch (used LocalToUtfRange): tps = 81.626596
Without: tps = 77.545849

UTF-8 to BIG5:

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

Patch: tps = 203.916336
Without: tps = 198.472985

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

Patch: tps = 96.896461
Without: tps = 94.470491

It should be noted that the size of the BIG5 to UTF-8 table can be
reduced to 13k records by simply reducing the allowed number of empty
records. But then the number of branches to find a value in the table
increases many times and the performance becomes like in Radix Tree
(or even a bit slower).

What's the result?

I would use Range Binary in Unicode case/normalization. The algorithm
shows good results. Plus it can be customized (increasing/decreasing)
the table by allowing empty values.

Also, I got a strong feeling that Encoding could be optimized and
improved. I am interested in this direction and would try my hand at
improving Encodings.

But I can't grab it all at once, it takes a systematic approach.
I suggest starting with Unicode Case, then Unicode Normalization.
Then experiment and improve Encoding.

--
Alexander Borisov

Attachment Content-Type Size
v1-0001-Comparison-two-algorithms-for-Encoding.patch text/plain 1013.8 KB
v4-0001-Optimization-for-lower-upper-casefold-functions.patch text/plain 706.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2025-02-11 20:13:41 Re: pgbench with partitioned tables
Previous Message Greg Sabino Mullane 2025-02-11 20:05:33 PATCH: Disallow a netmask of zero unless the IP is also all zeroes