From: | Alexander Borisov <lex(dot)borisov(at)gmail(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Optimization for lower(), upper(), casefold() functions. |
Date: | 2025-01-31 13:13:36 |
Message-ID: | 12956299-1496-4455-b3ab-299cc0a87d7c@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
31.01.2025 01:43, Heikki Linnakangas пишет:
Hi Heikki,
> Did you consider using a radix tree? We use that method in src/backend/
> utils/mb/Unicode/convutils.pm. I'm not sure if that's better or worse
> than what's proposed here, but it would seem like a more standard
> technique at least. Or if this is clearly better, then maybe we should
> switch to this technique in convutils.pm too. A perfect hash would be
> another alternative, we use that in src/common/unicode_norm.c.
I looked at the radix tree implementation, and according to number of
branches and mathematical operations I think radix tree will not be
faster than the proposed approach.
About the perfect hash.
The problem with the perfect hash is that it requires a Unicode
codepoint to be stored for matching.
Originally I started to optimize Unicode Normalization Form in Postgres.
But I decided to “practice” on a case so as not to scare anyone with
a big patch at once. Actually, I want to do Unicode in Postgres,
optimizations and improvements.
> Did you check that these optimizations still win with Unicode version 16
> (https://www.postgresql.org/message-id/146349e4-4687-4321-91af-
> f235572490a8(at)eisentraut(dot)org)? We haven't updated to that yet, but sooner
> or later we will.
Yes, everything works just as well with Unicode version 16 data.
> The way you're defining 'pg_unicode_case_index' as a function in the
> header file won't work. It needs to be a static inline function if it's
> in the header. Or put it in a .c file.
I agree, it needs to be moved to a .c file.
> Some ideas on how to squeeze this further:
>
> - Instead of having one table that contains Lower/Title/Upper/Fold for
> every character, it might be better to have four separate tables. I
> think that would be more cache-friendly: you typically run one of the
> functions for many different characters in a loop, rather than all of
> the functions for the same character. You could deduplicate between the
> tables too: for many ranges of characters, Title=Upper and Lower=Fold.
I'll try to experiment with that. Theoretically the performance should
increase.
> - The characters are stored as 4-byte integers, but the high byte is
> always 0. Could squeeze those out. Not sure if that would be a win if it
> makes the accesses unaligned, but you could benchmark that.
> Alternatively, use that empty byte to store the 'special_case' index,
> instead of having a separate field for it.
I thought about it, but it seems that it will make the code much more
complicated, and we won't gain much.
>
> - Many characters that have a special case only need the special case
> for some of the functions, not all. If you stored the special_case
> separately for each function (as the high byte in the 'simplemap' field
> perhaps, like I suggested on previous point), you could avoid having
> those "dummy" special cases.
That's a good idea. Then the main table will be reduced by uint8*n.
Thanks, after the weekend I'll send an updated patch that takes into
account the comments/advice.
--
SberTech
Alexander Borisov
From | Date | Subject | |
---|---|---|---|
Next Message | Maxim Orlov | 2025-01-31 13:29:15 | Re: Convert macros to static inline functions |
Previous Message | Ashutosh Bapat | 2025-01-31 12:59:11 | Re: NOT ENFORCED constraint feature |