Re: [Pljava-dev] Issue 21 Re: PL/java kills unicode chars?

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To:
Subject: Re: [Pljava-dev] Issue 21 Re: PL/java kills unicode chars?
Date: 2015-09-24 23:34:06
Message-ID: 5604886E.40104@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pljava-dev

Thomas Hallgren wrote:
> I'm not sure what you're asking here. It is always OK to use methods in
> the Java library without a thread lock provided there's absolutely no
> chance that any new threads that they might start calls back into the
> backend. Don't see how any of the methods you mention could do that.

Great, so far so good, that was a big part of my question. (Not purely
whether it was _technically_ ok, which is a question I could answer,
but also whether it would be satisfactory to you under the coding
conventions you have for PL/Java.)

Right now, all calls into the JNI seem to go through the wrapper methods
in JNICalls.c, which at the very least all play the BEGIN_JAVA/END_JAVA
shell-game with the JNIEnv pointer ... and if they are method invocations,
they further do the exit and reentry of the monitor. So if I wanted
to do more lightweight invocations only for charset encode/decode methods,
it looks as if somewhere (maybe in JNICalls.c) I would need to add
alternate method-invocation wrappers that do most of the same work but
leave out the monitor operations. I didn't know whether you would
consider that undesirable.

> In order to advice on a good solution I'd like to first know exactly
> what the problem and it's possible solutions are.

Well, so at bottom the problem is PL/Java treating the JNI methods
newStringUTF and getStringUTFChars as if they really meant UTF-8.
Of course they are really the JNI "modified UTF-8" which is much
closer to CESU-8 (except for also having the NUL hack, so it's not
quite that either).

By contrast, when you ask PostgreSQL for UTF-8, as with
pg_do_encoding_conversion( ... PG_UTF8 ... ), you get real, true,
no-funny-business UTF-8. The trouble is, that's not quite what should
be exchanged with JNI.

Nobody ever used to notice, back in the day when there weren't any
characters anyone cared about outside the first 64k Unicode plane.
But these days there are, as the person reporting issue 21 found out,
and that's where the problem shows up. As the test code I added to
issue 21 shows, all 1024 other 64k character code blocks above block 0
corrupted in a round trip from PL/Java through Java ... because UTF-8
is not CESU-8.

Possible solutions fall in some basic classes:

I. Use real UTF-8 consistently. Call the pg_ routine to get utf-8, as
currently done in String.c, but don't use the JNI newStringUTF/
getStringUTFChars (which are really CESU-8); instead use the
java.lang.String bytes constructor and get bytes methods with explicit
UTF-8 charset, because that really *is* UTF-8. Lower-level techniques
like using the java.nio.charset UTF-8 converters directly, to try to
reduce allocations and copies, I would also consider in Class I.

II. Use real CESU-8 consistently. Write a CESU-8 codec that can fit
into pg's converter scheme so you could just ask for that encoding from
pg and pass it straight to the JNI routines. Nice idea, but brick walls:
PG has a *hard-coded* set of encodings it knows. Funnily enough, it has
an API that lets you define new converters *between* encodings ... as long
as you are only talking about the encodings it already knows. Also, it
seems to have hardcoded assumptions that no encoding can make more than
4 bytes from a codepoint, where CESU-8 can sometimes make 6. So class II
seems to be impractical.

III. Use real UTF-16 consistently. That's the other encoding that has direct
JNI support in and out of strings, and it might likely be the JVM's internal
representation, so if there were a pg_ conversion to get from the server
encoding to UTF-16, the JNI calls in and out of strings should Just Work,
and quite possibly would work without incurring an extra conversion. Nice
work if you can get it, but there isn't a UTF-16 codec in pg, so it ends
up the same place as class II.

IV. Use real UCS-4 consistently. That's the pg_wchar encoding and there
are functions and macros provided by pg for converting between UTF-8 and
wchar, either character-at-a-time or whole-string-at-a-time. In recent
Javas, String does have a constructor from int[], and a method
getCodePointAt you can call in a loop, but no int[] getCodePoints().
In pg there is even a table giving foo <-> wchar converters for other
server encodings, so it could be possible to skip the intermediate UTF-8
step and go straight from server encoding to UCS-4 and build the string
from that. What's the problem with that? The only coders you find in the
table are the whole-string-at-a-time ones, and they *don't have any
parameter for output buffer length*! You have to guess, worst-case
expansion, and preallocate the whole thing. (Or allocate at the end of
memory, and catch SIGSEGV....) And the worst-case expansion is a lot
bigger than the average case.

I've shifted my affections over the last week among all four of those
classes, but I am currently leaning toward class I. (From some googling,
that seems to be a common solution for other Java devs too.) I *am*
interested in using the lower-level methods in the hope of at least
slightly reducing the reallocating and copying involved.

Class I could also become a class I' where you find out if the JVM knows
a coding that corresponds directly to the server encoding, and skip the
UTF-8 step. I am not planning to implement that, at least not at first.
In the case of server encoding == utf-8, that essentially happens anyway,
since the pg_do_encoding_conversion... makes itself a no-op in that case.
I assume utf-8 is the usual server encoding these days, so for a first
effort that sounds "good enough". It will still be correct for other
encodings, just slower.

-Chap

In response to

Browse pljava-dev by date

  From Date Subject
Next Message Chapman Flack 2015-10-04 23:52:34 Re: [Pljava-dev] conditional SQL in DDR, and a testing idea
Previous Message Thomas Hallgren 2015-09-24 06:21:04 [Pljava-dev] Issue 21 Re: PL/java kills unicode chars?