From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Collation & ctype method table, and extension hooks |
Date: | 2024-11-19 21:32:47 |
Message-ID: | 78a1b434ff40510dc5aaabe986299a09f4da90cf.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, 2024-11-01 at 14:08 +0100, Andreas Karlsson wrote:
> I think adding such a small file would make life easier for people
> new
> to the collation part of the code base. It would be a nice symmetry
> between collation providers and where code for them can be found.
Done.
> >
> For me combining them would make the intention of the code easier to
> understand since aren't the casemap functions just a set of
> "ctype_methods"?
Done.
There is a bit of weirdness in libc because:
* Single byte encodings use the single-byte isupper(), toupper(), etc.
* UTF8 encoding uses wide character iswupper(), towupper(), etc.
* Non-UTF8 multibyte encodings use isupper() for pattern matching but
towupper() for case mapping
that weirdness existed before, but it's a bit more obvious what's
happening now.
> > > This commit makes me tempted to handle the ctype_is_c logic for
> > > character classes also in callbacks and remove the if in
> > > functions
> > > like
> > > pg_wc_ispunct(). But this si something that would need to be
> > > benchmarked.
I like this idea, but it can be a follow up.
Attached new patchset.
I also tried some performance tests again. I used smalltext (a table of
10M ~30-character strings) and bigtext (a table of 32768 rows, each
containing the 100KiB source of https://en.wikipedia.org/wiki/Diacritic
). And I then ran the following regex on each:
select count(*) from thetable
where t ~
'[[:digit:]][[:space:]][[:punct:]][[:alpha:]][[:lower:]][[:upper:]]';
for "C", "en_US", and "en-US-x-icu". The timings for smalltext were
indistinguishable between master and the patched version. The timings
for bigtext were pretty noisy so it's hard to tell if there was a
regression or not, but I saw some evidence in the profile that
char_properties has a cost (~1%). I'm not sure if that's a significant
concern or not.
Which API do you think is the right one? Individual functions testing
individual properties, or something like char_properties() that can
test several at once?
Regards,
Jeff Davis
Attachment | Content-Type | Size |
---|---|---|
v8-0001-Perform-provider-specific-initialization-code-in-.patch | text/x-patch | 18.5 KB |
v8-0002-Control-collation-behavior-with-a-method-table.patch | text/x-patch | 18.9 KB |
v8-0003-Control-ctype-behavior-internally-with-a-method-t.patch | text/x-patch | 63.2 KB |
v8-0004-Remove-provider-field-from-pg_locale_t.patch | text/x-patch | 4.8 KB |
v8-0005-Make-provider-data-in-pg_locale_t-an-opaque-point.patch | text/x-patch | 21.5 KB |
v8-0006-Don-t-include-ICU-headers-in-pg_locale.h.patch | text/x-patch | 3.7 KB |
v8-0007-Introduce-hooks-for-creating-custom-pg_locale_t.patch | text/x-patch | 6.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2024-11-19 21:36:08 | Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4 |
Previous Message | Pavel Stehule | 2024-11-19 21:30:00 | Re: proposal: schema variables |