From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(at)eisentraut(dot)org> |
Subject: | Re: Collation & ctype method table, and extension hooks |
Date: | 2025-01-15 20:42:46 |
Message-ID: | 679b689354bc4cd394bac850c7a03454d3412c0b.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 2025-01-09 at 16:19 -0800, Jeff Davis wrote:
> On Mon, 2024-12-02 at 23:58 -0800, Jeff Davis wrote:
> > On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote:
> > > I feel your first patch in the series is something you can just
> > > commit.
> >
> > Done.
> >
> > I combined your patches and mine into the attached v10 series.
>
> Here's v12 after committing a few of the earlier patches.
I collected some performance numbers for a worst case on UTF8. This is
where each row is million characters wide and each one is greater than
MAX_SIMPLE_CHAR (U+07FF):
create table wide (t text);
insert into wide
select repeat('カ', 1048576)
from generate_series(1,1000) g;
select 1 from wide where t ~ '([[:punct:]]|[[:lower:]])'
collate "the_collation";
results:
master patched
C 3736 3589
pg_c_utf8 19500 23404
en_US 10251 12396
en-US-x-icu 10264 11963
And a separate test for ILIKE on en_US.iso885915 where each character
is beyond the ASCII range and needs to be lowercased using the
optimization for single-byte encodings in Generic_Text_IC_like:
create table sb (t text);
insert into sb
select repeat('É', 1048576)
from generate_series(1, 3000) g;
select 1 from sb where t ilike '%á%';
results:
master patched
C 2900 2812
en_US 2203 3702
en-US-x-icu 17483 18123
The numbers from both tests show a slowdown. The worst one is probably
tolower() for libc in LATIN9, which appears to be heavily optimized,
and the extra indirection for a method call slows things down quite a
bit.
This is a bit unfortunate because the method table feels like the right
code organization. Having special cases at the call sites (aside from
ctype_is_c) is not great. Are the above numbers bad enough that we need
to give up on this method-ization approach? Or should we say that the
above cases don't represent reality, and a moderate regression there is
OK?
Or perhaps someone has an idea how to mitigate the regression? I could
imagine another cache of character properties, like an extensible
pg_char_properties. I'm not sure if the extra complexity is worth it,
though.
Regards,
Jeff Davis
From | Date | Subject | |
---|---|---|---|
Next Message | Melanie Plageman | 2025-01-15 20:55:52 | Re: Eagerly scan all-visible pages to amortize aggressive vacuum |
Previous Message | Jim Jones | 2025-01-15 20:35:41 | Re: Add XMLNamespaces to XMLElement |