From: | J Smith <dark(dot)panda+lists(at)gmail(dot)com> |
---|---|
To: | Florian Pflug <fgp(at)phlo(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: unaccent extension missing some accents |
Date: | 2011-11-06 23:43:22 |
Message-ID: | CADFUPgeEw31kAoY3_9nH==uP9QesYKKTwLV_OgwVKM=P1VvnFg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Nov 6, 2011 at 1:18 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>
> What's the locale of the database you're seeing this in, and which charset
> does it use?
>
> I think scanf() uses isspace() and friends, and last time I looked the
> locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
> decides that 0xA0 is whitespace.
>
Ah, that does it: the locale I was using in the test code was just
plain ol' C locale, while in the database it was en_CA.UTF-8. Changing
the locale in my test code produced the wonky results. I should have
figured it was a locale problem. Sure enough, in a UTF-8 locale, it
believes that both 0xa0 and 0x85 are spaces. Pretty wonky behaviour
indeed.
Apparently this is a known OSX issue that has its roots in and older
version of FreeBSD's libc I guess, eh? I've found various bug reports
that allude to the problem and they all seem to point that way.
I've attached a patch against master for unaccent.c that uses swscanf
along with char2wchar and wchar2char instead of sscanf directly to
initialize the unaccent extension and it appears to fix the problem in
both the master and 9.1 branches.
I haven't added any tests in the expected output file 'cause I'm not
exactly sure what I should be testing against, but I could take a
crack at that, too, if the patch looks reasonable and is usable.
Cheers.
Attachment | Content-Type | Size |
---|---|---|
0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch | application/octet-stream | 1.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2011-11-07 00:15:04 | Re: unaccent extension missing some accents |
Previous Message | YAMAMOTO Takashi | 2011-11-06 23:08:07 | reduce null bitmap size |