| From: | "David E(dot) Wheeler" <david(at)kineticode(dot)com> | 
|---|---|
| To: | Alex Hunsaker <badalex(at)gmail(dot)com> | 
| Cc: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: plperlu problem with utf8 | 
| Date: | 2010-12-17 03:24:46 | 
| Message-ID: | C9982425-2453-479A-88FB-D12B6F20839B@kineticode.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:
> You might argue this is a bug with URI::Escape as I *think* all uri's
> will be utf8 encoded.  Anyway, I think postgres is doing the right
> thing here.
No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1.
> In playing around I did find what I think is a postgres bug.  Perl has
> 2 ways it can store things internally.  per perldoc perlunicode:
> 
> Using Unicode in XS
> ... What the "UTF8" flag means is that the sequence of octets in the
> representation of the scalar is the sequence of UTF-8 encoded code
> points of the characters of a string.  The "UTF8" flag being off means
> that each octet in this representation encodes a single character with
> code point 0..255 within the string.
> 
> Postgres always prints whatever the internal representation happens to
> be ignoring the UTF8 flag and the server encoding.
> 
> # create or replace function chr(i int, i2 int) returns text as $$
> return chr($_[0]).chr($_[1]); $$ language plperlu;
> CREATE FUNCTION
> 
> # show server_encoding;
> server_encoding
> -----------------
> SQL_ASCII
> 
> # SELECT length(chr(128, 33));
> length
> --------
>      2
> 
> # SELECT length(chr(128, 333));
> length
> --------
>      4
> 
> Grr that should error out with "Invalid server encoding", or worst
> case should return a length of 3 (it utf8 encoded 128 into 2 bytes
> instead of leaving it as 1).  In this case the 333 causes perl store
> it internally as utf8.
Well with SQL_ASCII anything goes, no?
> Now on a utf8 database:
> 
> # show server_encoding;
> server_encoding
> -----------------
> UTF8
> 
> # SELECT length(chr(128, 33));
> ERROR:  invalid byte sequence for encoding "UTF8": 0x80
> CONTEXT:  PL/Perl function "chr"
> 
> # SELECT length(chr(128, 333));
> CONTEXT:  PL/Perl function "chr"
> length
> --------
>      2
> 
> Same thing here, we just end up using the internal format.  In one
> case it works in the other it does not.  The main point being, most of
> the time it *happens* to work.  But its really just by chance.
> 
> I think what we should do is use SvPVutf8() when we are UTF8 instead
> of SvPV in sv2text_mbverified().  SvPV gives us a pointer to a string
> in perls current internal format (maybe unicode, maybe a utf8 byte
> sequence).  While SvPVutf8 will always give us utf8 (may or may not be
> valid!) encoded string.
> 
> Something like the attached.  Thoughts? Im not very happy with the non
> utf8 case--  The elog(ERROR, "invalid byte sequence") is a total
> cop-out yes.  But I did not see a good solution short of hand rolling
> our own version of sv_utf8_downgrade().  Is it worth it?
> <plperl_encoding.patch>
Maybe I'm misunderstanding, but it seems to me that:
* String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representation before the function actually gets them.
* Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encoding before they're returned.
I didn't really follow all of the above; are you aiming for the same thing?
Best,
David
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alex Hunsaker | 2010-12-17 04:39:34 | Re: plperlu problem with utf8 | 
| Previous Message | Shigeru HANADA | 2010-12-17 02:49:31 | Re: SQL/MED - file_fdw |