From: | Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Update list of combining characters |
Date: | 2019-06-04 20:58:46 |
Message-ID: | bbb19114-af1e-513b-08a9-61272794bd5c@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
In src/backend/utils/mb/wchar.c, function ucs_wcwidth(), there is a list
of Unicode combining characters, so that those can be ignored for
computing the display length of a Unicode string. It seems to me that
that list is either outdated or plain incorrect.
For example, the list starts with
{0x0300, 0x034E}, {0x0360, 0x0362}, {0x0483, 0x0486},
Let's look at the characters around the first "gap":
(https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt)
034C;COMBINING ALMOST EQUAL TO ABOVE;Mn;230;NSM;;;;;N;;;;;
034D;COMBINING LEFT RIGHT ARROW BELOW;Mn;220;NSM;;;;;N;;;;;
034E;COMBINING UPWARDS ARROW BELOW;Mn;220;NSM;;;;;N;;;;;
034F;COMBINING GRAPHEME JOINER;Mn;0;NSM;;;;;N;;;;;
0350;COMBINING RIGHT ARROWHEAD ABOVE;Mn;230;NSM;;;;;N;;;;;
0351;COMBINING LEFT HALF RING ABOVE;Mn;230;NSM;;;;;N;;;;;
So these are all in the "Mn" category, so they should be treated all the
same here. Indeed, psql doesn't compute the width of some of them
correctly:
postgres=> select u&'|oo\034Coo|';
+----------+
| ?column? |
+----------+
| |oXoo| |
+----------+
postgres=> select u&'|oo\0350oo|';
+----------+
| ?column? |
+----------+
| |oXoo| |
+----------+
(I have replaced the combined character with X above so that the mail
client rendering doesn't add another layer of uncertainty to this issue.
The point is that the box is off in the second example.)
AFAICT, these Unicode definitions haven't changed since that list was
put in originally around 2006, so I wonder what's going on there.
I have written a script that recomputes that list from the current
Unicode data. Patch and script are attached. This makes those above
cases all render correctly. (This should eventually get better built
system integration.)
Thoughts?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
gen-combining.pl | text/x-perl-script | 923 bytes |
0001-Update-list-of-combining-characters.patch | text/plain | 6.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2019-06-04 21:30:03 | Re: Sort support for macaddr8 |
Previous Message | Dave Cramer | 2019-06-04 20:55:33 | Re: Binary support for pgoutput plugin |