Quick Links

Re: unaccent extension missing some accents

From:	J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: unaccent extension missing some accents
Date:	2011-11-07 16:46:46
Message-ID:	CADFUPgeUqK3qqUkV=8H85UXcLMmKq7oHtm4tAkpf2n16Xsk0MQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I looked at this a bit and realized that sscanf is actually doing a
> couple of critical things for us, which are lost in translation when
> doing it like this:
>
> 1. It ignores whitespace other than the dividing tab. If we don't
> continue to do that, we'll likely break existing config files.
>
> 2. It ensures that src and trg each consist of at least one (nonblank)
> character. placeChar() is critically dependent on the assumption that
> src is not empty.
>
> However, after looking around a bit at the other tsearch config-file-
> reading functions, I noted that they all use t_isspace() to identify
> whitespace ... and that function in fact should be okay on OS X,
> because it uses iswspace in multibyte encodings.
>
> So it's fairly simple to improve this code to reject whitespace that
> way. I don't like the existing code anyway because of its potential
> vulnerability to buffer overrun. I'll fix it up and commit.
>
>> As for the other problems with isspace and such on OSX, it might be
>> worth looking at the python portability fixes.
>
> If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
> work), I might be tempted to try to do it that way, but I still fail
> to see the point. After reviewing the code I feel that unaccent needs
> to be fixed because it's not consistent with the other tsearch config
> file parsers, and not so much because it works or doesn't work on any
> specific platform.
>

Yeah, I never knew there was such a problem with OSX and UTF8 before
running into it here but it's good to know. When I noticed the
unnaccent extension in more recent PostgreSQL versions, I figured it
would perform better than our current plperl-based accent stripping
function (which it surely does) and just noticed the results on my
machine were a little off, but our linux-based servers were fine and
dandy and yadda yadda yadda.

Anyways, lemme know if there's anything else I could help with or
could test and whatnot. Cheers.

In response to

Re: unaccent extension missing some accents at 2011-11-07 16:12:47 from Tom Lane

Responses

Re: unaccent extension missing some accents at 2011-11-07 16:53:04 from Florian Pflug
Re: unaccent extension missing some accents at 2011-11-07 16:59:47 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Florian Pflug	2011-11-07 16:53:04	Re: unaccent extension missing some accents
Previous Message	Jeff Davis	2011-11-07 16:28:15	Re: btree gist known problems