From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Peter Eisentraut <peter_e(at)gmx(dot)net> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, MauMau <maumau307(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII |
Date: | 2013-09-09 18:57:28 |
Message-ID: | 20130909185728.GA217886@tornado.leadboat.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Sep 09, 2013 at 08:29:58AM -0400, Peter Eisentraut wrote:
> On 9/6/13 10:37 AM, Tom Lane wrote:
> > BTW: personally, I would say that what you're looking at is a glibc bug.
> > I always thought the contract of gettext was to return the ASCII version
> > if it fails to produce a translated version. That might not be what the
> > end user really wants to see, but surely returning something like "???"
> > is completely useless to anybody.
>
> The question marks come from iconv. Take a look at what this prints:
>
> iconv po/ja.po -f utf-8 -t us-ascii//translit
>
> If you use GNU libiconv, this will print a bunch of question marks.
Actually, GNU libiconv's iconv() decides that //translit is unimplementable
for some of the characters in that file, and it fails the conversion. GNU
libc's iconv(), on the other hand, emits the question marks.
> I think the use of //translit by gettext is poor judgement, because my
> experiments show that the quality of the results is poor and not useful
> for a user interface.
It depends on the quality of the //translit implementation. GNU libiconv's
seems pretty good. It gives up for Japanese or Russian characters, so you get
the English messages. For Polish, GNU libiconv transliterates like this:
msgstr "nie można usunąć pliku lub katalogu \"%s\": %s\n"
msgstr "nie mozna usuna'c pliku lub katalogu \"%s\": %s\n"
That's fair, considering what it has to work with. Ideally, (a) GNU libc
should import the smarter transliteration code from GNU libiconv, and (b) GNU
gettext should check for weak //translit implementations and not use
//translit under such circumstances.
> My suggestion in this matter is to disable gettext processing when
> LC_CTYPE is set to C. We could log a warning when LC_MESSAGES is set to
> something and LC_CTYPE is set to C. Or just do the warning and keep
> logging. Something like that.
In an ENCODING=UTF8, LC_CTYPE=C database, no transliteration should need to
happen, and no transliteration does happen for the PG messages. I think
MauMau's original bind_textdomain_codeset() proposal was on the right track.
We would need to do that for every relevant 3rd-party message domain, though.
Ick. This suggests to me that gettext really needs an API for overriding the
default codeset pertaining to message domains not subjected to
bind_textdomain_codeset(). In the meantime, adding bind_textdomain_codeset()
calls for known localized dependencies seems like a fine coping mechanism.
If we can reasonably detect when gettext is supplying useless ????? messages,
that's good, too.
Thanks,
nm
--
Noah Misch
EnterpriseDB http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2013-09-09 19:38:11 | Re: [bug fix] strerror() returns ??? in a UTF-8/C database with LC_MESSAGES=non-ASCII |
Previous Message | Tomas Vondra | 2013-09-09 18:07:12 | Re: [rfc] overhauling pgstat.stat |