Re: Unicode grapheme clusters

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unicode grapheme clusters
Date: 2023-01-19 13:44:57
Message-ID: CAFj8pRDow1__QV9YbLp5DSCdeGR87hhAPphBE1NX7TWkRhFZ-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

čt 19. 1. 2023 v 1:20 odesílatel Bruce Momjian <bruce(at)momjian(dot)us> napsal:

> Just my luck, I had to dig into a two-"character" emoji that came to me
> as part of a Google Calendar entry --- here it is:
>
> 👩🏼‍⚕️🩺
>
> libc
> Unicode UTF8 len
> U+1F469 f0 9f 91 a9 2 woman
> U+1F3FC f0 9f 8f bc 2 emoji modifier fitzpatrick type-3 (skin
> tone)
> U+200D e2 80 8d 0 zero width joiner (ZWJ)
> U+2695 e2 9a 95 1 staff with snake
> U+FE0F ef b8 8f 0 variation selector-16 (VS16) (previous
> character as emoji)
> U+1FA7A f0 9f a9 ba 2 stethoscope
>
> Now, in Debian 11 character apps like vi, I see:
>
> a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2)
>
> Display widths are in parentheses. I also see '<200d>' in blue.
>
> In current Firefox, I see a woman with a stethoscope around her neck,
> and then a stethoscope. Copying the Unicode string above into a browser
> URL bar should show you the same thing, thought it might be too small to
> see.
>
> For those looking for details on how these should be handled, see this
> for an explanation of grapheme clusters that use things like skin tone
> modifiers and zero-width joiners:
>
> https://tonsky.me/blog/emoji/
>
> These comments explain the confusion of the term character:
>
>
> https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
>
> and I think this comment summarizes it well:
>
>
> https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237
>
> This is by design. wcwidth() is utterly broken. Any terminal or
> terminal
> application that uses it is also utterly broken. Forget about emoji
> wcwidth() doesn't even work with combining characters, zero width
> joiners, flags, and a whole bunch of other things.
>
> I decided to see how Postgres, without ICU, handles it:
>
> show lc_ctype;
> lc_ctype
> -------------
> en_US.UTF-8
>
> select octet_length('👩🏼‍⚕️🩺');
> octet_length
> --------------
> 21
>
> select character_length('👩🏼‍⚕️🩺');
> character_length
> ------------------
> 6
>
> The octet_length() is verified as correct by counting the UTF8 bytes
> above. I think character_length() is correct if we consider the number
> of Unicode characters, display and non-display.
>
> I then started looking at how Postgres computes and uses _display_
> width. The display width, when properly processed like by Firefox, is 4
> (two double-wide displayed characters.) Based on the libc display
> lengths above and incorrect displayed character lengths in Debian 11, it
> would be 7.
>
> libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls
> the per-encoding width function stored in pg_wchar_table.dsplen --- for
> UTF8, the function is pg_utf_dsplen().
>
> There is no SQL API for display length, but PQdsplen() that can be
> called with a string by calling pg_wcswidth() the gdb debugger:
>
> pg_wcswidth(const char *pwcs, size_t len, int encoding)
> UTF8 encoding == 6
>
> (gdb) print (int)pg_wcswidth("abcd", 4, 6)
> $8 = 4
> (gdb) print (int)pg_wcswidth("👩🏼‍⚕️🩺", 21, 6))
> $9 = 7
>
> Here is the psql output:
>
> SELECT octet_length('👩🏼‍⚕️🩺'), '👩🏼‍⚕️🩺',
> character_length('👩🏼‍⚕️🩺');
> octet_length | ?column? | character_length
> --------------+----------+------------------
> 21 | 👩🏼‍⚕️🩺 | 6
>
> More often called from psql are pg_wcssize() and pg_wcsformat(), which
> also calls PQdsplen().
>
> I think the question is whether we want to report a string width that
> assumes the display doesn't understand the more complex UTF8
> controls/"characters" listed above.
>
> tsearch has p_isspecial() calls pg_dsplen() which also uses
> pg_wchar_table.dsplen. p_isspecial() also has a small table of what it
> calls "strange_letter",
>
> Here is a report about Unicode variation selector and combining
> characters from May, 2022:
>
>
> https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp
>
> Is this something people want improved?
>

Surely it should be fixed. Unfortunately - all the terminals that I can use
don't support it. So at this moment it may be premature to fix it, because
the visual form will still be broken.

Regards

Pavel

> --
> Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
> EDB https://enterprisedb.com
>
> Embrace your flaws. They make you human, rather than perfect,
> which you will never be.
>
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2023-01-19 13:55:18 Re: HOT chain validation in verify_heapam()
Previous Message Arthur Nascimento 2023-01-19 13:42:47 Re: vac_update_datfrozenxid will raise "wrong tuple length" if pg_database tuple contains toast attribute.