Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"

From: Andreas Kalsch <andreaskalsch(at)gmx(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"
Date: 2009-08-04 15:19:54
Message-ID: 4A78519A.9050809@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I know what you are talking about, but I am not sure how many websites
really check for incoming encoding. Usually you trust that the client
will use the same encoding for sending data as the server has sent.
(This is what I mean with my simplified chain)

It's some extra work to do converting from exotic sets and it's the way
to go for a site with high traffic. I would prefer to do it on the
PHP/Python side to send in one encoding to the database server, because
it could be that you additionally send own strings.

Alban Hertroys schrieb:
> On 4 Aug 2009, at 15:02, Andreas Kalsch wrote:
>
>> Alban,
>>
>> what I do to simplify the data chain:
>>
>> HTTP encoding > PHP string encoding > client connection > server -
>> all is UTF8. Plus invalid byte check in PHP (or server).
>
> You're missing my point. You start dealing with the encoding of the
> data you receive too late. Here's what happens:
>
> 1). User enters data in your form using his web browser. This user has
> set his OS to some locale and encoding, say it's WIN1252-JP.
>
> 2). User presses "submit" and the web-browser generates a POST request
> that CONTAINS the encoding the user used to enter his data.
>
> 3). Your scripts process this data as if it were UTF-8, but it ISN'T
> UTF-8, it's WIN1252-JP.
>
> 4). You try to store this (now) mixedly encoded data into an UTF-8
> encoded database and it fails for obvious reasons.
>
>
> To solve this you can try to filter out "wrong" characters, but they
> are not wrong. They're just differently encoded than you think they are.
>
> What you should be doing is look at the encoding specified in the POST
> request and either:
>
> a.) convert the data to UTF-8 before you start processing it (both PHP
> and Python have functions to do that) or,
>
> b.) pass the encoding on to your database queries so that the database
> knows how to convert them to UTF-8.
>
> Of course it can't hurt to verify that the encoding specified in the
> POST request does indeed match the posted data, but that's a different
> story.
>
> P.S. Please don't top post and keep some context of what you're
> replying to. Your messages are a bit confusing the way you write them.
>
>> What I have tested inside Postgres is entering a 3 byte UTF8
>> character to this function. And I have got an error. This is a
>> character I will not filter out, if some Unicode artists will enter
>> it. It is an international website and the simplification is just for
>> indexing.
>>
>> But I think that this will not solve the problem and I have to use
>> Python or Perl to get it done.
>>
>>
>> Alban Hertroys schrieb:
>>> On 4 Aug 2009, at 24:57, Andreas Kalsch wrote:
>>>
>>>>> I think the real problem is: Where do you lose the original
>>>>> encoding the users input their data with? If you specify that
>>>>> encoding on the connection and send it to a database that can
>>>>> handle UTF-8 then you shouldn't be getting any conversion problems
>>>>> in the first place.
>>>> Nowhere - I will validate input data on the client side (PHP or
>>>> Python) and send it to the server. Of course the only encoding I
>>>> will use on any side is UTF8. I just wnated to use this Latin thing
>>>> for simplification of characters.
>>>
>>> Yes you are. How could your users input invalid characters in the
>>> first place if that were not the case? You're not suggesting they
>>> managed to enter characters in an encoding for which they weren't
>>> valid on their own systems, do you?[1]
>>>
>>> You say your client is using PHP or Python, which suggests it's a
>>> website. That means the input goes like this: web browser -> website
>>> -> database. All three of those steps use some encoding and you can
>>> take them into account. That should prevent this problem altogether.
>>>
>>> You have control over which encoding your client and the database
>>> use, and the web browser tells what encoding it used in the POST
>>> request so you can pass that along to the database when storing data
>>> or convert it in your client.
>>>
>>> [1] There exists of course a small group of people who enjoy posting
>>> raw byte data to a web-form, but would it matter whether they'd get
>>> an error about their encoding or not? They do not intend to enter
>>> valid data after all ;)
>>>
>>> Alban Hertroys
>>>
>>> --
>>> If you can't see the forest for the trees,
>>> cut the trees and you'll see there is no forest.
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>>
>>
>
> Alban Hertroys
>
> --
> If you can't see the forest for the trees,
> cut the trees and you'll see there is no forest.
>
>
> !DSPAM:933,4a784d8610131719966171!
>
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Greg Stark 2009-08-04 15:23:28 Re: character 0xe29986 of encoding "UTF8" has no equivalent in "LATIN2"
Previous Message Paul Taylor 2009-08-04 15:04:42 Re: Does derby have an embedded Mode like Derby ?