From: | J Smith <dark(dot)panda+lists(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: unaccent extension missing some accents |
Date: | 2011-11-07 16:46:46 |
Message-ID: | CADFUPgeUqK3qqUkV=8H85UXcLMmKq7oHtm4tAkpf2n16Xsk0MQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I looked at this a bit and realized that sscanf is actually doing a
> couple of critical things for us, which are lost in translation when
> doing it like this:
>
> 1. It ignores whitespace other than the dividing tab. If we don't
> continue to do that, we'll likely break existing config files.
>
> 2. It ensures that src and trg each consist of at least one (nonblank)
> character. placeChar() is critically dependent on the assumption that
> src is not empty.
>
> However, after looking around a bit at the other tsearch config-file-
> reading functions, I noted that they all use t_isspace() to identify
> whitespace ... and that function in fact should be okay on OS X,
> because it uses iswspace in multibyte encodings.
>
> So it's fairly simple to improve this code to reject whitespace that
> way. I don't like the existing code anyway because of its potential
> vulnerability to buffer overrun. I'll fix it up and commit.
>
>> As for the other problems with isspace and such on OSX, it might be
>> worth looking at the python portability fixes.
>
> If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
> work), I might be tempted to try to do it that way, but I still fail
> to see the point. After reviewing the code I feel that unaccent needs
> to be fixed because it's not consistent with the other tsearch config
> file parsers, and not so much because it works or doesn't work on any
> specific platform.
>
Yeah, I never knew there was such a problem with OSX and UTF8 before
running into it here but it's good to know. When I noticed the
unnaccent extension in more recent PostgreSQL versions, I figured it
would perform better than our current plperl-based accent stripping
function (which it surely does) and just noticed the results on my
machine were a little off, but our linux-based servers were fine and
dandy and yadda yadda yadda.
Anyways, lemme know if there's anything else I could help with or
could test and whatnot. Cheers.
From | Date | Subject | |
---|---|---|---|
Next Message | Florian Pflug | 2011-11-07 16:53:04 | Re: unaccent extension missing some accents |
Previous Message | Jeff Davis | 2011-11-07 16:28:15 | Re: btree gist known problems |