Quick Links

Re: Mac OS: invalid byte sequence for encoding "UTF8"

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc:	Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>, "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Mac OS: invalid byte sequence for encoding "UTF8"
Date:	2016-02-10 16:58:00
Message-ID:	28139.1455123480@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> writes:
> I agree that previous patch is wrong. Instead of using new
> parse_ooaffentry() function maybe better to use sscanf() with %ls
> format. The %ls format is used to read a wide character string.

No, that way is going to give you worse portability problems than what
we have now. Older implementations won't have %ls, and even if they
do, they might not have wcstombs() which is the only way you'd get from
libc's idea of wide characters to an encoding we recognize.

> I think this is not a bug. It is a normal behavior. In Mac OS sscanf()
> with the %s format reads the string one character at a time. The size of
> letter '' is 2. And sscanf() separate it into two wrong characters.

That argument might be convincing if OSX behaved that way for all
multibyte characters, but it doesn't seem to be doing that. Why is
only '' affected?

regards, tom lane

In response to

Re: Mac OS: invalid byte sequence for encoding "UTF8" at 2016-02-10 13:39:33 from Artur Zakirov

Responses

Re: Mac OS: invalid byte sequence for encoding "UTF8" at 2016-02-10 22:19:45 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Dunstan	2016-02-10 17:03:38	Re: Tracing down buildfarm "postmaster does not shut down" failures
Previous Message	Teodor Sigaev	2016-02-10 16:46:39	Re: [PROPOSAL] Improvements of Hunspell dictionaries support