Quick Links

Character encoding problems

From:	Bruce Clay <bclay1297(at)att(dot)net>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Character encoding problems
Date:	2011-12-09 03:54:31
Message-ID:	35b888aa-eac8-4b23-9f17-a04feb58854b@Mariah
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Sorry for the duplicate postings. I have only recieved one reply so far and that was a suggestion to post to this forum.

I trying to build a database to support natural language processing from a variety of data files posted on the internet. Many of them are identified as using UTF-8 encoding. Some of these are dictionary files fro WinEdt. Some are from an Open Source multi-lingual health care package.

When I try to build a table from several of the different languages I get the following error

ERROR: invalid byte sequence for encoding "UTF8": 0x82

I checked the encoding and it is indeed set up for Unicode-8. I tried to create databases using a variety of other encoding types such as WIN1252 and others and I got the same error message from all of them except SQL_ASCII.

When I created the database using SQL_ASCII I received the warning that the database could only store 7 bit data. When I loaded the data in this database I did not have any errors and when I look at the data it seems to be the same as in the original text file.

Is there a "proper" encoding type that I should use to load the word lists so they can be interoperable with the WordNet dataset that happily uses the UTF8 encoding?

Bruce

Responses

Re: Character encoding problems at 2011-12-09 08:20:21 from John R Pierce

Browse pgsql-general by date

	From	Date	Subject
Next Message	Chris Travers	2011-12-09 04:17:57	Re: Hope for a new PostgreSQL era?
Previous Message	Tom Lane	2011-12-09 03:16:23	Re: Function Question