pgsql: Add simple codepoint redirections to unaccent.rules.

From: Thomas Munro <tmunro(at)postgresql(dot)org>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: Add simple codepoint redirections to unaccent.rules.
Date: 2024-07-05 03:26:58
Message-ID: E1sPZb6-000Mjd-Mp@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Add simple codepoint redirections to unaccent.rules.

Previously we searched for code points where the Unicode data file
listed an equivalent combining character sequence that added accents.
Some codepoints redirect to a single other codepoint, instead of doing
any combining. We can follow those references recursively to get the
answer.

Per bug report #18362, which reported missing Ancient Greek characters.
Specifically, precomposed characters with oxia (from the polytonic
accent system used for old Greek) just point to precomposed characters
with tonos (from the monotonic accent system for modern Greek), and we
have to follow the extra hop to find out that they are composed with
an acute accent.

Besides those, the new rule also:

* pulls in a lot of 'Mathematical Alphanumeric Symbols', which are
copies of the Latin and Greek alphabets and numbers rendered
in different typefaces, and

* corrects a single mathematical letter that previously came from the
CLDR transliteration file, but the new rule extracts from the main
Unicode database file, where clearly the latter is right and the
former is a wrong (reported to CLDR).

Reported-by: Cees van Zeeland <cees(dot)van(dot)zeeland(at)freedom(dot)nl>
Reviewed-by: Robert Haas <robertmhaas(at)gmail(dot)com>
Reviewed-by: Peter Eisentraut <peter(at)eisentraut(dot)org>
Reviewed-by: Michael Paquier <michael(at)paquier(dot)xyz>
Discussion: https://postgr.es/m/18362-be6d0cfe122b6354%40postgresql.org

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/18501841bcb4e693b9f1e9da2b2fb524c78940d8

Modified Files
--------------
contrib/unaccent/expected/unaccent.out | 2 +-
contrib/unaccent/generate_unaccent_rules.py | 19 +-
contrib/unaccent/unaccent.rules | 1013 ++++++++++++++++++++++++++-
3 files changed, 1025 insertions(+), 9 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message David Rowley 2024-07-05 04:56:40 pgsql: Fix newly introduced issue in EXPLAIN for Materialize nodes
Previous Message David Rowley 2024-07-05 02:05:29 pgsql: Add memory/disk usage for Material nodes in EXPLAIN