From: | Larry Rosenman <ler(at)lerctr(dot)org> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, pgsql-hackers-owner(at)postgresql(dot)org |
Subject: | Re: Mac OS: invalid byte sequence for encoding "UTF8" |
Date: | 2016-02-10 22:39:11 |
Message-ID: | d94fdeb7997353bf0ba6906679a89d0c@thebighonker.lerctr.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2016-02-10 16:19, Tom Lane wrote:
> I wrote:
>> Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> writes:
>>> I think this is not a bug. It is a normal behavior. In Mac OS
>>> sscanf()
>>> with the %s format reads the string one character at a time. The size
>>> of
>>> letter 'х' is 2. And sscanf() separate it into two wrong characters.
>
>> That argument might be convincing if OSX behaved that way for all
>> multibyte characters, but it doesn't seem to be doing that. Why is
>> only 'х' affected?
>
> I looked into the OS X sources, and found that indeed you are right:
> *scanf processes the input a byte at a time, and applies isspace() to
> each byte separately, even when the locale is such that that's a
> clearly
> insane thing to do. Since this code was derived from FreeBSD, FreeBSD
> has or once had the same issue. (A look at the freebsd project on
> github
> says it still does, assuming that's the authoritative repo.) Not sure
> about other BSDen.
>
> I also verified that in UTF8-based locales, isspace() thinks that 0x85
> and
> 0xA0, and no other high-bit-set values, are spaces. Not sure exactly
> why
> it thinks that, but that explains why 'х' fails when adjacent code
> points
> don't.
>
> So apparently the coding rule we have to adopt is "don't use *scanf()
> on data that might contain multibyte characters". (There might be
> corner
> cases where it'd work all right for conversion specifiers other than
> %s,
> but probably you might as well just use strtol and friends in such
> cases.)
> Ugh.
>
> regards, tom lane
Definitive FreeBSD Sources:
https://svnweb.freebsd.org/base/
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ler(at)lerctr(dot)org
US Mail: 7011 W Parmer Ln, Apt 1115, Austin, TX 78729-6961
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2016-02-10 23:00:39 | Re: Mac OS: invalid byte sequence for encoding "UTF8" |
Previous Message | Tom Lane | 2016-02-10 22:19:45 | Re: Mac OS: invalid byte sequence for encoding "UTF8" |