From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Built-in CTYPE provider |
Date: | 2023-12-13 13:28:30 |
Message-ID: | b3e445d3958c4d71bd3027f2ba6423c36b90353d.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
> My biggest concern is around maintenance. Every year Unicode is
> assigning new characters to existing code points, and those existing
> code points can of course already be stored in old databases before
> libs
> are updated.
Is the concern only about unassigned code points?
I already committed a function "unicode_assigned()" to test whether a
string contains only assigned code points, which can be used in a
CHECK() constraint. I also posted[5] an idea about a per-database
option that could reject the storage of any unassigned code point,
which would make it easier for users highly concerned about
compatibility.
> And we may end up with
> something like the timezone database where we need to periodically
> add a
> more current ruleset - albeit alongside as a new version in this
> case.
There's a build target "update-unicode" which is run to pull in new
Unicode data files and parse them into static C arrays (we already do
this for the Unicode normalization tables). So I agree that the tables
should be updated but I don't understand why that's a problem.
> If I'm reading the Unicode 15 update correctly, PostgreSQL regex
> expressions with [[:digit:]] will not correctly identify Kaktovik or
> Nag
> Mundari or Kawi digits without that update to character type specs.
Yeah, if we are behind in the Unicode version, then results won't be
the most up-to-date. But ICU or libc could also be behind in the
Unicode version.
> But lets remember that people like to build indexes on character
> classification functions like upper/lower, for case insensitive
> searching.
UPPER()/LOWER() are based on case mapping, not character
classification.
I intend to introduce a SQL-level CASEFOLD() function that would obey
Unicode casefolding rules, which have very strong compatibility
guarantees[6] (essentially, if you are only using assigned code points,
you are fine).
> It's another case where the index will be corrupted if
> someone happened to store Latin Glottal vowels in their database and
> then we update libs to the latest character type rules.
I don't agree with this characterization at all.
(a) It's not "another case". Corruption of an index on LOWER() can
happen today. My proposal makes the situation better, not worse.
(b) These aren't libraries, I am proposing built-in Unicode tables
that only get updated in a new major PG version.
(c) It likely only affects a small number of indexes and it's easier
for an administrator to guess which ones might be affected, making it
easier to just rebuild those indexes.
(d) It's not a problem if you stick to assigned code points.
> So even with something as basic as character type, if we're going to
> do
> it right, we still need to either version it or definitively decide
> that
> we're not going to every support newly added Unicode characters like
> Latin Glottals.
If, by "version it", you mean "update the data tables in new Postgres
versions", then I agree. If you mean that one PG version would need to
support many versions of Unicode, I don't agree.
Regards,
Jeff Davis
[5]
https://postgr.es/m/c5e9dac884332824e0797937518da0b8766c1238.camel@j-davis.com
[6] https://www.unicode.org/policies/stability_policy.html#Case_Folding
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2023-12-13 14:23:14 | Re: Remove MSVC scripts from the tree |
Previous Message | Robert Haas | 2023-12-13 13:16:20 | Re: trying again to get incremental backup |