Re: Collation & ctype method table, and extension hooks

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter(at)eisentraut(dot)org>
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2025-01-15 20:42:46
Message-ID: 679b689354bc4cd394bac850c7a03454d3412c0b.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2025-01-09 at 16:19 -0800, Jeff Davis wrote:
> On Mon, 2024-12-02 at 23:58 -0800, Jeff Davis wrote:
> > On Mon, 2024-12-02 at 16:39 +0100, Andreas Karlsson wrote:
> > > I feel your first patch in the series is something you can just
> > > commit.
> >
> > Done.
> >
> > I combined your patches and mine into the attached v10 series.
>
> Here's v12 after committing a few of the earlier patches.

I collected some performance numbers for a worst case on UTF8. This is
where each row is million characters wide and each one is greater than
MAX_SIMPLE_CHAR (U+07FF):

create table wide (t text);
insert into wide
select repeat('カ', 1048576)
from generate_series(1,1000) g;

select 1 from wide where t ~ '([[:punct:]]|[[:lower:]])'
collate "the_collation";

results:
master patched
C 3736 3589
pg_c_utf8 19500 23404
en_US 10251 12396
en-US-x-icu 10264 11963

And a separate test for ILIKE on en_US.iso885915 where each character
is beyond the ASCII range and needs to be lowercased using the
optimization for single-byte encodings in Generic_Text_IC_like:

create table sb (t text);
insert into sb
select repeat('É', 1048576)
from generate_series(1, 3000) g;

select 1 from sb where t ilike '%á%';

results:

master patched
C 2900 2812
en_US 2203 3702
en-US-x-icu 17483 18123

The numbers from both tests show a slowdown. The worst one is probably
tolower() for libc in LATIN9, which appears to be heavily optimized,
and the extra indirection for a method call slows things down quite a
bit.

This is a bit unfortunate because the method table feels like the right
code organization. Having special cases at the call sites (aside from
ctype_is_c) is not great. Are the above numbers bad enough that we need
to give up on this method-ization approach? Or should we say that the
above cases don't represent reality, and a moderate regression there is
OK?

Or perhaps someone has an idea how to mitigate the regression? I could
imagine another cache of character properties, like an extensible
pg_char_properties. I'm not sure if the extra complexity is worth it,
though.

Regards,
Jeff Davis

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2025-01-15 20:55:52 Re: Eagerly scan all-visible pages to amortize aggressive vacuum
Previous Message Jim Jones 2025-01-15 20:35:41 Re: Add XMLNamespaces to XMLElement