Quick Links

Re: [HACKERS] tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Richard Huxton <dev(at)archonet(dot)com>, patrick <patrick(at)11h11(dot)com>, pgsql-hackers(at)postgresql(dot)org, PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject:	Re: [HACKERS] tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3
Date:	2008-03-20 14:40:13
Message-ID:	8739.1206024013@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> On Wed, Mar 19, 2008 at 07:55:40PM -0400, Tom Lane wrote:
>> (that's \303\240 or 0xc3 0xa0). I am thinking that something decided
>> the \240 was junk and removed it.

> Hmm, it is coincidently the space character +0x80, which is defined as
> a non-breaking space in many Latin encodings.

Yeah, that's what I'm thinking about. I poked around in Microsoft's
documentation and couldn't find any suggestion that fgets() would
remove such a character, however.

Another possible theory is that the french.stop file got edited using
something that had the wrong idea about the file's encoding, and
proceeded to throw away the nbsp.

regards, tom lane

In response to

Re: tsearch2 in postgresql 8.3.1 - invalid byte sequence for encoding "UTF8": 0xc3 at 2008-03-20 13:16:04 from Martijn van Oosterhout

Browse pgsql-general by date

	From	Date	Subject
Next Message	Adrian Klaver	2008-03-20 14:43:39	Re: Row size overhead
Previous Message	Zubkovsky, Sergey	2008-03-20 14:24:38	Re: Row size overhead

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2008-03-20 14:50:53	Re: Proposal: new large object API
Previous Message	Simon Riggs	2008-03-20 14:35:38	Unique Constraints using Non-Unique Indexes