Re: PL/Java new build plugin

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: Kartik Ohri <kartikohri13(at)gmail(dot)com>
Cc: pljava-dev(at)lists(dot)postgresql(dot)org
Subject: Re: PL/Java new build plugin
Date: 2020-07-09 20:08:06
Message-ID: 5F077926.2050806@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pljava-dev

On 07/09/20 14:45, Kartik Ohri wrote:
> Meanwhile, I tried to eliminate the properties file and encountered the
> include path not found issue in nar-maven plugin. It took some time to find
> the issue but I was able to find the issue. It was that the pg_config
> appended a newline character to its output which was being added to the

Ah! Good catch. Yes, in your earlier approach with readLine(), that
terminator would have been eaten already.

It turns out [1] pg_config always ends the line with a single \n (no
matter the platform) so that's always what to remove. (Java's readLine
could have caused a related sort of problem too; it would be unlikely
for the values we care about in this case to end with \r, but if one
did, readLine would see the \r followed by the line-ending \n and eat both.)

> Moving on, I tried to eliminate the CDATA regex in the pom.xml. I have come
> up with a Java version. There are a few caveats like hex escaping is not
> supported in Java so these need to be converted to Unicode escaping and so
> on. But I do not know how to test the regex and what the purpose of
> libjvmdefault is here. Can you explain it and tell how to test this regex
> escaping code ?

I assume what you mean by a Java version is a Java version of the JavaScript
quoteStringForC() function, as the regex it's using is already built on
java.util.regex, so should not need much/any revision. (Strictly, the string
literals being passed to compile() might have to be checked for any
differences between JS and Java string-literal syntax, but nothing is
jumping out at me.)

There are some regrettable subtleties in the pattern because of differences
between what escapes a string-literal understands and what escapes a
java,util.regex.Pattern understands: you can see in capture group 2 that
\a\f\n\r\t\x0B are all spelled with double-backslashes, which means they
survive the string literal (the double backslashes becoming single ones)
and the regex compiler literally sees \a\f\n\r\t\x0B and it understands all
those forms so that's fine, but the \b does not have its backslash doubled.
So the string literal turns it immediately into a real backspace character,
which is what the regex compiler sees. \b in the regex language does have
a meaning but it's different: it means a word boundary, not a backspace.

Of course the purpose of quoteStringForC is "you give me an arbitrary
string, I'll give you the exact C language string literal that you need
for building that string into a C program." So the output it produces
has to be correct according to the rules of C.

I wanted to make it as clear as I could by breaking the regex into three
lines (capture group 1, group 2, and groups 3,4 which go together) and
organizing the function logic the same way.

Group 1 takes care of the things that only need a \ in front to be escaped
for C: \ itself, ", and a ? if it could look like part of a C trigraph [2].
Really, C will turn \? to ? anywhere, but to avoid cluttering the string
with backslashes, it is only necessary to find sequences like ??[=(/)'<!>-]
and \ just one of the ? characters, to prevent the sequence being seen as
a trigraph.

Group 2 is for characters known by special escape sequences; most of those
are the same in C and Java, but C has \a and \v and Java doesn't (though
java.util.regex does know \a), so those have to be spelled differently
for Java (Unicode or octal would work).

Group 3 just catches any other control character and turns it into a C
\xnn escape. Group 4 is only to detect if the control character is
followed by a hexdigit. That's necessary because C's \x escape will keep
eating as many hexdigits as it sees, so you can't just turn a character
into a \xnn if what follows it is also a hexdigit. You can turn it into
\xnn"" which essentially ends one string literal and starts another one.
C always joins together adjacent string literals, so that solves the
problem.

While general-purpose, the first reason to have a quoteStringForC function
was so a person building PL/Java could add -Dpljava.libjvmdefault=some_path
on the Maven command line, and have "some_path" properly escaped be built
into the C code so it becomes the default for pljava.libjvm_location and
there's no need to fuss with that GUC in PostgreSQL to make PL/Java work,

... which ends up, on Windows, being a challenging test of how reliably
Java passes complicated argument values to the C compiler process ... a
test the existing ProcessBuilder implementation does not pass.

Regards,
-Chap

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/pg_config/pg_config.c;h=5279e7f#l124

[2] https://en.wikibooks.org/wiki/C_Programming/C_trigraph

In response to

Responses

Browse pljava-dev by date

  From Date Subject
Next Message Kartik Ohri 2020-07-09 20:46:41 Re: PL/Java new build plugin
Previous Message Kartik Ohri 2020-07-09 18:52:08 Re: PL/Java new build plugin