Re: Unicode database question

From: Tino Wildenhain <tino(at)wildenhain(dot)de>
To: Lynna Landstreet <lynna(at)gallery44(dot)org>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Unicode database question
Date: 2003-07-17 06:00:59
Message-ID: 3F163B9B.8010903@wildenhain.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Lanna,

we are running postgresql with UNICODE encoding on a regular
basis for our shop. This basically means it stores and retrieves
strings as utf-8 encoded. If you dont need special collating
rules, thats the way to go. However we are using Python/Zope
in front of the DB for presentation and maybe PHP behaves
differently. Another problem with Web-Clients is, that they
sometimes send forms with a default charset and not with what
your form HTML originally had. Meaning you send a page
with utf-8 and the post-resquest goes in as iso8859-1 or something
like that. This is irritating and should be investigated.
Try a simple recording proxy or packet sniffer with tcp-stream
assembling ability to log whats going over the wire.
One solution to the browsers bug is to mark the page with
a well known string which gets sent in the answer (say hidden
form field) and undergoes the same charset rules as the rest of
the form. If then you get your string back with the answer you
can check the encoding/charset.

HTH
Tino Wildenhain

Lynna Landstreet wrote:
> Hello,
>
> I'm running into a bit of trouble with a Unicode-enabled PostgreSQL database
> (some of the data consists of artist and/or image names in other languages,
> like French, Spanish, German and Portuguese, which frequently have accents,
> and I don't want people entering data to have to use ASCII codes). Having (I
> thought) managed to get past the issues of exporting text as Unicode in
> order to import it into the database and uploading the text files as binary
> instead of data to keep them Unicode/UTF-8 as I upload them, and then using
> psql's \copy command to insert the data into the database, I can't get the
> special characters to display properly on the web. :-(
>
> I'm not even sure how to tell if the problem is on the input side or the
> output side - as in, whether it's that the data in the database got muddled
> on the way in and is not valid Unicode, or whether it's OK but every means I
> try to use to view it doesn't want to accept Unicode. I'm pretty sure the
> text files got to the server OK as Unicode, because I was able to view them
> directly with a web browser and the special characters were OK then. But
> when I imported them into the database, I was not then able to view the
> special characters correctly, either in my browser through the PHP frontend
> I'm developing for the database or phpPgAdmin, or via Telnet/SSH. So I don't
> know if the problem came about somehow while using \copy to import them, or
> with the means I'm using to view them.
>
> I've set the charset encoding of my PHP pages to UTF-8, and the default
> encoding in my browser as well, but that doesn't help. And I've tried
> editing the data through phpPgAdmin to restore the special characters, but
> got the following error message:
>
> Error - /[path to my web directory]/phpPgAdmin/tbl_replace.php -- Line: 77
>
> PostgreSQL said: ERROR: Invalid UNICODE character sequence found (0xe7e36f)
> Your query:
> UPDATE "artists" SET "artist_id" = 485, "firstname" = 'Teresa', "lastname" =
> 'Ascenção'... [rest of query deleted]
>
> Ironically, the accented characters in her last name (a c with a cedilla and
> an a with a tilde, in case they don't show up here) displayed fine in the
> error message! But it wouldn't enter them into the database.
>
> Questions that come to mind:
>
> 1. Does anyone have any idea what's going wrong here?
> 2. Can \copy reduce UTF-8 text to plain ASCII while importing data from a
> text file?
> 3. If so, can it be made not to, maybe through adding some kind of parameter
> to the command? Or is there a better way to import the data?
> 4. Is if correct for the database encoding to be "UNICODE" or should it be
> UTF-8 specifically? My impression thus far was that Unicode and UTF-8 were
> more or less the same thing, but maybe more or less isn't good enough.
> 5. Does a web form have to be specially coded to accept text with accented
> characters into a database, or does the encoding of the database itself
> and/or the web page the form is on determine that?
>
> Any assistance would be much appreciated...
>
>
> Lynna

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tony Grant 2003-07-17 06:20:09 Re: Unicode database question
Previous Message Kallol Nandi 2003-07-17 04:23:42 Query regarding back up