From: | "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: support for DIN SPEC 91379 encoding |
Date: | 2022-03-27 23:34:16 |
Message-ID: | 20220327233416.l2zncgxa4rz3zvlx@hjp.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 2022-03-27 14:06:25 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> > On 2022-Mar-27, Ralf Schuchardt wrote:
> >> linked here https://www.xoev.de/downloads-2316#StringLatin it is said,
> >> that the spec is a strict subset of unicode (E.1.6), and it is also
> >> mentioned in E.1.4, that in UTF-8 all unicode characters can be
> >> encoded. Therefore UTF-8 can be used to encode all DIN SPEC 91379
> >> characters.
>
> > So the remaining question is whether DIN SPEC 91379 requires an
> > implementation to support character U+0000. If it does, then PostgreSQL
> > is not conformant, because that character is the only one in Unicode
> > that we don't support. If U+0000 is not required, then PostgreSQL is
> > okay.
>
> Hmm ... UTF8 as defined in RFC3629/STD63 [1] does not allow "all unicode
> characters to be encoded". It disallows surrogate pairs (U+D800--U+DFFF)
> and code points above U+10FFFF.
From section 2.4 Code Points and Characters of the Unicode Standard,
Version 14.0 - Core Specification:
| In the Unicode Standard, the codespace consists of the integers from 0
| to 10FFFF 16, com- prising 1,114,112 code points available for
| assigning the repertoire of abstract characters.
So there are no characters above U+10FFFF.
Also,
| Not all assigned code points represent abstract characters; only
| Graphic, Format, Control and Private-use do. Surrogates and
| Noncharacters are assigned code points but are not assigned to
| abstract characters.
So Surrogates aren't characters either.
UTF-8 can indeed be used to encode "all unicode characters".
> We follow that spec, so depending on what DIN 91379 *actually* says,
> we might have additional reasons not to be in compliance. I don't
> read German unfortunately.
It defines minimal character set that IT systems which process personal
and company names in the EU must accept. Basically Latin, Greek and
Cyrillic letters, digits and some symbols and interpunctation.
hp
--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp(at)hjp(dot)at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
From | Date | Subject | |
---|---|---|---|
Next Message | Per Kaminsky | 2022-03-28 06:53:49 | Re: Performance issues on FK Triggers after replacing a primary column |
Previous Message | Adrian Klaver | 2022-03-27 21:22:44 | Re: Performance issues on FK Triggers after replacing a primary column |