From: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | james(at)360data(dot)ca |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: How well does PostgreSQL 9.6.1 support unicode? |
Date: | 2016-12-21 07:56:37 |
Message-ID: | 20161221.165637.246733544.horiguchi.kyotaro@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello,
At Tue, 20 Dec 2016 16:41:51 -0800, James Zhou <james(at)360data(dot)ca> wrote in <CAGuREpPHJmoHe_5+P25UCosRvqQpbhPF_0LGFbJ+xYgUKndydg(at)mail(dot)gmail(dot)com>
> Unicode has evolved from version 1.0 with 7,161 characters released in 1991
> to version 9.0 with 128,172 characters released in June 2016. My questions
> are
> - which version of Unicode is supported by PostgreSQL 9.6.1?
> - what does "supported" exactly mean? simply store it? comparison? sorting?
> substring? etc.
...
> /* characters from BMP, 0000 - FFFF */
> insert into unicode(id, string) values(1, U&'\0041'); -- 'A'
...
> insert into unicode(id, string) values(5, U&'\6211\4EEC'); -- a string of two Chinese characters
These shouldn't be a problem.
> /* Below are unicode characters with code points beyond FFFF, aka planes 1 - F */
> insert into unicode(id, string) values(100, U&'\1F478'); -- a mojo character, https://unicodelookup.com/#0x1f478/1
https://www.postgresql.org/docs/9.6/static/sql-syntax-lexical.html
> Unicode characters can be specified in escaped form by writing a
> backslash followed by the four-digit hexadecimal code point
> number or alternatively a backslash followed by a plus sign
> followed by a six-digit hexadecimal code point number.
So this is parsed as U+1f47 + '8' as you seen. This should be as
the following. '+' is needed just after the backslash.
insert into unicode(id, string) values(100, U&'\+01F478');
The six-digit form accepts up to U+10FFFF so the whole space in
Unicode is usable.
> Observations
>
> - BMP characters (id <= 10)
> - they are stored and fetched correctly.
> - their lengths in char are correct, although some of them take 3
> bytes (id = 4, 6, 7)
> - *But their sorting order seems to be undefined. Can anyone comment
> the sorting rules?*
> - Non-BMP characters (id >= 100)
> - they take 2 - 4 bytes.
> - Their lengths in character are not correct
> - they are not retrieved correctly, judged by the their fetched ascii
> value (column 5 in the table above)
> - substring is not correct
>
> Specifically, the lack of support for emojo characters 0x1F478, 0x1F479 is
> causing a problem in my application.
'+' would resolve the problem.
> My conclusion:
> - PostgreSQL 9.6.1 only supports a subset of unicode characters in BMP. Is
> there any documents defining which subset is fully supported?
A PostgreSQL database with encoding=UTF8 just accepts the whole
range of Unicode, regardless that a character is defined for the
code or not.
> Are any configuration I can change so that more unicode characters are
> supported?
For the discussion on sorting, categorize is described in Tom's
mail.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Yogesh Sharma | 2016-12-21 08:59:49 | |
Previous Message | James Zhou | 2016-12-21 07:17:56 | Re: How well does PostgreSQL 9.6.1 support unicode? |