[Pljava-dev] PL/java kills unicode chars?

From: vatsan(dot)cs at utexas(dot)edu (Srivatsan Ramanujam)
To:
Subject: [Pljava-dev] PL/java kills unicode chars?
Date: 2013-08-14 01:13:29
Message-ID: CAHEGxbNQvw4VD00Otfbo0u+d9JVaMWqBWL3ggx0mHJyRcDuFLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pljava-dev

Hi All,

I believe PL/java is killing unicode characters (it is probably converting
text to a byte stream and reading them as single byte characters - perhaps
Latin-1 and not as UTF-8). I don't observe this happening with PL/Python or
PL/R.

I basically have a record which looks like the attached image *
(input_record.tiff)*

When I invoke a PL/Java function to simply read this input text field and
return it as is, i noticed that the unicode characters are lost. The
attached image *(output_record.tiff)* shows the result.

Here is my PL/java function and the corresponding java snippet.

*PL/Java Function*

drop function if exists demo.returnString(text) cascade;
create function demo.returnString(text)
returns text
as
'demopkg.Example.returnString'
immutable language pljavau;

*Java Snippet (in class demopkg.Example)*
*
*
public static String returnString(String tweet) {
if (tweet == null) {
return null;
}
Writer writer = null;

try {
writer = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("/tmp/pljava_out.txt"), "utf-8"));
writer.write("Tweet\n");
writer.write(tweet);
} catch (IOException ex){
// report
} finally {
try {writer.close();} catch (Exception ex) {}
}
return tweet;
}

* Here is how I am invoking the SQL*

select tweet_body,
demo.returnString(tweet_body) pljava_result,
demo.dummy(tweet_body) plpython_result
from demo.training_data
where id = 'tag:search.twitter.com,2005:356830788370706433'

The file that I am writing out to in the java code
(/tmp/pljava_out.txt) shows that the unicode chars have already been lost
(i don't see the emoticons in the file). So the error is occurring even
before the function returns - perhaps during the postgres type to java type
conversion.

*Things I have tried to debug*
*
*
1) My database is UTF-8
2) The "file.encoding" property also returns UTF-8 (when I invoke the
PL/java function to return the property).
3) The "LOCALE" settings on my machine is also UTF-8.
4) The problem only occurs with PL/java (PL/Python and PL/R return the
string alright).
5) Exhausted Google to search for this, but there is only one other user
who has reported it and there is no resolution. Here is the related thread:
http://lists.pgfoundry.org/pipermail/pljava-dev/2008/001385.html

Any pointers? I'm thinking that as a last resort one work-around would be
to pass the string as a bytea and decode as UTF-8 within the java code
block - I'm not sure if that will work but it looks like a terrible
work-around even before I attempt it.

Thank you,
Vatsan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pgfoundry.org/pipermail/pljava-dev/attachments/20130813/e02810ff/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: input_record.tiff
Type: image/tiff
Size: 21024 bytes
Desc: not available
URL: <http://lists.pgfoundry.org/pipermail/pljava-dev/attachments/20130813/e02810ff/attachment-0002.tiff>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: output_record.tiff
Type: image/tiff
Size: 26904 bytes
Desc: not available
URL: <http://lists.pgfoundry.org/pipermail/pljava-dev/attachments/20130813/e02810ff/attachment-0003.tiff>

Responses

Browse pljava-dev by date

  From Date Subject
Next Message John R Pierce 2013-11-06 07:28:35 [Pljava-dev] Status of pljava with Java 7, Postgres 9.2+, etc
Previous Message Hal Hildebrand 2013-08-06 21:25:28 [Pljava-dev] Boolean NULL translation in PL/Java JDBC Driver