Quick Links

Re: badly calculated width of emoji in psql

From:	John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To:	Jacob Champion <pchampion(at)vmware(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "pavel(dot)stehule(at)gmail(dot)com" <pavel(dot)stehule(at)gmail(dot)com>, "laurenz(dot)albe(at)cybertec(dot)at" <laurenz(dot)albe(at)cybertec(dot)at>, "peter(dot)eisentraut(at)enterprisedb(dot)com" <peter(dot)eisentraut(at)enterprisedb(dot)com>, "michael(at)paquier(dot)xyz" <michael(at)paquier(dot)xyz>, "horikyota(dot)ntt(at)gmail(dot)com" <horikyota(dot)ntt(at)gmail(dot)com>
Subject:	Re: badly calculated width of emoji in psql
Date:	2021-08-25 20:15:34
Message-ID:	CAFBsxsH5ejH4-1xaTLpSK8vWoK1m6fA1JBtTM6jmBsLfmDki1g@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Aug 24, 2021 at 1:50 PM Jacob Champion <pchampion(at)vmware(dot)com> wrote:
>
> Does there need to be any sanity check for overlapping ranges between
> the combining and fullwidth sets? The Unicode data on a dev's machine
> would have to be broken somehow for that to happen, but it could
> potentially go undetected for a while if it did.

It turns out I should have done that to begin with. In the Unicode data, it
apparently happens that a character can be both combining and wide, and
that will cause ranges to overlap in my scheme:

302A..302D;W # Mn [4] IDEOGRAPHIC LEVEL TONE MARK..IDEOGRAPHIC
ENTERING TONE MARK

{0x3000, 0x303E, 2},
{0x302A, 0x302D, 0},

3099..309A;W # Mn [2] COMBINING KATAKANA-HIRAGANA VOICED SOUND
MARK..COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

{0x3099, 0x309A, 0},
{0x3099, 0x30FF, 2},

Going by the above, Jacob's patch from July 21 just happened to be correct
by chance since the combining character search happened first.

It seems the logical thing to do is revert my 0001 and 0002 and go back to
something much closer to Jacob's patch, plus a big comment explaining that
the order in which the searches happen matters.

The EastAsianWidth.txt does have combining property "Mn" in the comment
above, so it's tempting to just read that (plus we could read just one file
for these properties). However, it seems risky to rely on comments, since
their presence and format is probably less stable than the data format.
--
John Naylor
EDB: http://www.enterprisedb.com

In response to

Re: badly calculated width of emoji in psql at 2021-08-24 17:50:50 from Jacob Champion

Responses

Re: badly calculated width of emoji in psql at 2021-08-26 15:12:58 from John Naylor
Re: badly calculated width of emoji in psql at 2021-08-26 15:25:22 from Jacob Champion

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephen Frost	2021-08-25 20:33:05	Re: log_autovacuum in Postgres 14 -- ordering issue
Previous Message	Justin Pryzby	2021-08-25 19:29:51	Re: Autovacuum on partitioned table (autoanalyze)