Re: Mac OS: invalid byte sequence for encoding "UTF8"

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Mac OS: invalid byte sequence for encoding "UTF8"
Date: 2016-02-10 16:58:00
Message-ID: 28139.1455123480@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> writes:
> I agree that previous patch is wrong. Instead of using new
> parse_ooaffentry() function maybe better to use sscanf() with %ls
> format. The %ls format is used to read a wide character string.

No, that way is going to give you worse portability problems than what
we have now. Older implementations won't have %ls, and even if they
do, they might not have wcstombs() which is the only way you'd get from
libc's idea of wide characters to an encoding we recognize.

> I think this is not a bug. It is a normal behavior. In Mac OS sscanf()
> with the %s format reads the string one character at a time. The size of
> letter '' is 2. And sscanf() separate it into two wrong characters.

That argument might be convincing if OSX behaved that way for all
multibyte characters, but it doesn't seem to be doing that. Why is
only '' affected?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2016-02-10 17:03:38 Re: Tracing down buildfarm "postmaster does not shut down" failures
Previous Message Teodor Sigaev 2016-02-10 16:46:39 Re: [PROPOSAL] Improvements of Hunspell dictionaries support