From: | Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Optimization for lower(), upper(), casefold() functions. |
Date: | 2025-02-06 20:16:01 |
Message-ID: | 65fc398c-864a-47f7-88d1-d8be2219d14a@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
06.02.2025 22:08, Jeff Davis пишет:
> On Thu, 2025-02-06 at 18:39 +0300, Alexander Borisov wrote:
>> Since I started to improve Unicode Case, I used the same approach,
>> essentially a binary search, only not by individual values, but by
>> ranges.
>
> I considered it a 4th approach because of the generated branches in
> case_index(). Case_index() is essentially a part of the data structure
> -- you can't navigate the table without it.
>
> By the way, could we just represent those ranges as another tiny table
> instead of representing them as branches?
>
> In other words, a table something like:
>
> {{.start=0x0000, .end=0x0588, .offset=0},
> {.start=0x10A0, .end=0x1100, .offset=0x10A0 - 0x0588},
> ...
> {.start=0x1E900, .end=0x1E944, .offset=0x1E900 - 0x11D3}}
>
> How would that perform?
I'll try this approach, but it seems like the only hope here is compiler
optimizations.
>> I plan to use the same approach for Unicode Normalization. Acting
>> sequentially and suggesting improvements.
>> Unfortunately I'm new here and I can't say why one place uses one
>> approach and another another - I just don't know.
>
> We have 3 or 4 different solutions (naive binary search, range binary
> search, perfect hashing, and radix tree) to 3 or 4 similar problems
> (normalization, character properties, case mapping, and encoding
> conversion). We should consolidate these approaches somehow.
>
> IIUC, out of the solutions, it seems that either your modified binary
> search or the radix tree are best because they are both fast and both
> allow compact tables.
>
> My questions are: is there a clear winner between radix tree and the
> range binary search for all four problems? If not, what are the trade-
> offs?
I think it is already clear that there is only one thing left - to test
radix tree and range binary in how they will behave in Encoding
and Unicode Case.
That is, implement both approaches in both cases.
I'm thinking of taking big5 (big table there) to utf-8 for testing.
The main thing is to understand how to test it correctly (pgbench).
Okay, I'll do that.
But it's worth remembering that there is no such thing as a silver
bullet.
--
Alexander Borisov
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2025-02-06 20:47:52 | Re: Conflict detection for update_deleted in logical replication |
Previous Message | Marcos Pegoraro | 2025-02-06 19:08:24 | Re: Better visualization of default values |