Re: Collation & ctype method table, and extension hooks

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Collation & ctype method table, and extension hooks
Date: 2024-11-19 21:32:47
Message-ID: 78a1b434ff40510dc5aaabe986299a09f4da90cf.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 2024-11-01 at 14:08 +0100, Andreas Karlsson wrote:
> I think adding such a small file would make life easier for people
> new
> to the collation part of the code base. It would be a nice symmetry
> between collation providers and where code for them can be found.

Done.

> >
> For me combining them would make the intention of the code easier to
> understand since aren't the casemap functions just a set of
> "ctype_methods"?

Done.

There is a bit of weirdness in libc because:

* Single byte encodings use the single-byte isupper(), toupper(), etc.
* UTF8 encoding uses wide character iswupper(), towupper(), etc.
* Non-UTF8 multibyte encodings use isupper() for pattern matching but
towupper() for case mapping

that weirdness existed before, but it's a bit more obvious what's
happening now.

> > > This commit makes me tempted to handle the ctype_is_c logic for
> > > character classes also in callbacks and remove the if in
> > > functions
> > > like
> > > pg_wc_ispunct(). But this si something that would need to be
> > > benchmarked.

I like this idea, but it can be a follow up.

Attached new patchset.

I also tried some performance tests again. I used smalltext (a table of
10M ~30-character strings) and bigtext (a table of 32768 rows, each
containing the 100KiB source of https://en.wikipedia.org/wiki/Diacritic
). And I then ran the following regex on each:

select count(*) from thetable
where t ~
'[[:digit:]][[:space:]][[:punct:]][[:alpha:]][[:lower:]][[:upper:]]';

for "C", "en_US", and "en-US-x-icu". The timings for smalltext were
indistinguishable between master and the patched version. The timings
for bigtext were pretty noisy so it's hard to tell if there was a
regression or not, but I saw some evidence in the profile that
char_properties has a cost (~1%). I'm not sure if that's a significant
concern or not.

Which API do you think is the right one? Individual functions testing
individual properties, or something like char_properties() that can
test several at once?

Regards,
Jeff Davis

Attachment Content-Type Size
v8-0001-Perform-provider-specific-initialization-code-in-.patch text/x-patch 18.5 KB
v8-0002-Control-collation-behavior-with-a-method-table.patch text/x-patch 18.9 KB
v8-0003-Control-ctype-behavior-internally-with-a-method-t.patch text/x-patch 63.2 KB
v8-0004-Remove-provider-field-from-pg_locale_t.patch text/x-patch 4.8 KB
v8-0005-Make-provider-data-in-pg_locale_t-an-opaque-point.patch text/x-patch 21.5 KB
v8-0006-Don-t-include-ICU-headers-in-pg_locale.h.patch text/x-patch 3.7 KB
v8-0007-Introduce-hooks-for-creating-custom-pg_locale_t.patch text/x-patch 6.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-11-19 21:36:08 Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Previous Message Pavel Stehule 2024-11-19 21:30:00 Re: proposal: schema variables