From: | "David E(dot) Wheeler" <david(at)kineticode(dot)com> |
---|---|
To: | Alex Hunsaker <badalex(at)gmail(dot)com> |
Cc: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: plperlu problem with utf8 |
Date: | 2010-12-17 03:24:46 |
Message-ID: | C9982425-2453-479A-88FB-D12B6F20839B@kineticode.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:
> You might argue this is a bug with URI::Escape as I *think* all uri's
> will be utf8 encoded. Anyway, I think postgres is doing the right
> thing here.
No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.
> In playing around I did find what I think is a postgres bug. Perl has
> 2 ways it can store things internally. per perldoc perlunicode:
>
> Using Unicode in XS
> ... What the "UTF8" flag means is that the sequence of octets in the
> representation of the scalar is the sequence of UTF-8 encoded code
> points of the characters of a string. The "UTF8" flag being off means
> that each octet in this representation encodes a single character with
> code point 0..255 within the string.
>
> Postgres always prints whatever the internal representation happens to
> be ignoring the UTF8 flag and the server encoding.
>
> # create or replace function chr(i int, i2 int) returns text as $$
> return chr($_[0]).chr($_[1]); $$ language plperlu;
> CREATE FUNCTION
>
> # show server_encoding;
> server_encoding
> -----------------
> SQL_ASCII
>
> # SELECT length(chr(128, 33));
> length
> --------
> 2
>
> # SELECT length(chr(128, 333));
> length
> --------
> 4
>
> Grr that should error out with "Invalid server encoding", or worst
> case should return a length of 3 (it utf8 encoded 128 into 2 bytes
> instead of leaving it as 1). In this case the 333 causes perl store
> it internally as utf8.
Well with SQL_ASCII anything goes, no?
> Now on a utf8 database:
>
> # show server_encoding;
> server_encoding
> -----------------
> UTF8
>
> # SELECT length(chr(128, 33));
> ERROR: invalid byte sequence for encoding "UTF8": 0x80
> CONTEXT: PL/Perl function "chr"
>
> # SELECT length(chr(128, 333));
> CONTEXT: PL/Perl function "chr"
> length
> --------
> 2
>
> Same thing here, we just end up using the internal format. In one
> case it works in the other it does not. The main point being, most of
> the time it *happens* to work. But its really just by chance.
>
> I think what we should do is use SvPVutf8() when we are UTF8 instead
> of SvPV in sv2text_mbverified(). SvPV gives us a pointer to a string
> in perls current internal format (maybe unicode, maybe a utf8 byte
> sequence). While SvPVutf8 will always give us utf8 (may or may not be
> valid!) encoded string.
>
> Something like the attached. Thoughts? Im not very happy with the non
> utf8 case-- The elog(ERROR, "invalid byte sequence") is a total
> cop-out yes. But I did not see a good solution short of hand rolling
> our own version of sv_utf8_downgrade(). Is it worth it?
> <plperl_encoding.patch>
Maybe I'm misunderstanding, but it seems to me that:
* String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representation before the function actually gets them.
* Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encoding before they're returned.
I didn't really follow all of the above; are you aiming for the same thing?
Best,
David
From | Date | Subject | |
---|---|---|---|
Next Message | Alex Hunsaker | 2010-12-17 04:39:34 | Re: plperlu problem with utf8 |
Previous Message | Shigeru HANADA | 2010-12-17 02:49:31 | Re: SQL/MED - file_fdw |