Re: [Pljava-dev] Arrays support for object views

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: Kris Jurka <books(at)ejurka(dot)com>, Krzysztof Nienartowicz <yazuna(at)gmail(dot)com>, "pljava-dev(at)lists(dot)postgresql(dot)org" <pljava-dev(at)lists(dot)postgresql(dot)org>
Subject: Re: [Pljava-dev] Arrays support for object views
Date: 2020-10-06 03:25:07
Message-ID: 5F7BE393.4010005@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pljava-dev

On 09/17/20 20:20, Chapman Flack wrote:
> I have tried to touch it as little as possible. A good deal of refactoring
> in 1.6.x is happening to make it more thinkable to touch the type workings,
> but actually revamping the types is planned to happen later. When that
> happens, I would like to have a PostgreSQL-specific class that exposes all
> the features of PostgreSQL arrays: multidimensionality, arbitrary starting
> index, the works.
>
>
> For 1.5.x and 1.6.x, I will consider the null-not-accepted problem and
> the "char"[] value corruption to be bugs, and look at whether they will
> be easy to fix. I think multidimensionality is more than I want to bite
> off at the moment;

1.5.6 was released last night and resolves the null issue, the "char"[]
value issue, and the must-use-Java's-boxed-array issue.

For a PostgreSQL int4[], for example, the plain ResultSet.getObject(int)
will return a boxed array java.lang.Integer[] and it may contain nulls.

If you want a primitive array, use the getObject overload that takes
an extra Class<?>, and pass int[].class, and you'll get back an int[].
If there were nulls in the input, those elements will be zero in the
int[], and there will be no way to distinguish those from true zero values.
(In the case of a boolean array, nulls will map to false.)

Either a boxed or a primitive array can be passed to updateObject. If a
primitive array, none of its elements will be treated as null.

A multidimensional PostgreSQL array will become a one-dimensional Java
array with the same number of elements. In the other direction, only a
one-dimensional Java array is accepted, and will produce a one-dimensional
PostgreSQL array.

So, that's the support we've got now. Krzysztof expressed some interest
in improving that. I would eventually like to have an interface that
exposes all the properties of PostgreSQL arrays:

* They can have any number of dimensions
* They do not have to be indexed from 0 or from 1; the start index and
size can be specified for each dimension independently
* Elements can be null

Internally, they are stored with a null bitmap (if nulls are allowed)
followed by a linear sequence of the non-null values. When there is no
null bitmap, the offset of the desired element is a straightforward
calculation from the indices; if there is a null bitmap, it has to
be consulted to determine not just whether the wanted element is null,
but also, if not, how many intervening values are, to subtract from
the offset.

The most interesting challenge is to design a Java API that will
expose that type in the most useful way.

JDBC offers the java.sql.Array interface, but its design assumes
just one-dimensional arrays, indexed from 1. It has methods to
retrieve the values, or a slice of the values, either as a Java
array or as a ResultSet (two columns, the index and the value).

The pgjdbc-ng driver offers a PGArray class that does not really
go beyond implementing java.sql.Array. It has some clever behavior
to present you a PostgreSQL array with multiple dimensions: if
you call getArray(), it gives you a multidimensional Java array,
and if you call getResultSet(), it gives you a ResultSet whose
values are themselves PGArray instances.

Still, the only way offered to create and populate an Array instance
to pass /to/ PostgreSQL is with the Connection.createArrayOf() method,
which assumes one dimension and requires all of the values in the form
of a Java array at once. And for primitives, only the boxed array can
be given. (PGArray has an accessible constructor, but that works
about the same way.)

Experimenting with pgjdbc-ng, I see that if the array you pass to
createArrayOf is a multidimensional Java array, it will make a
multidimensional PostgreSQL array. I get the impression they kind of
shoehorned that into an interface that never really anticipated it.
Still. I think it would be better to be able to specify the dimensions
and index bases, nulls or not, and have the option of supplying values
all at once or over subsequent calls, maybe a slice at a time.

The older pgjdbc driver has a similar class, PgArray (small g), again
little changed from java.sql.Array. I see that it, too, will accept
a multidimensional Java array and do the right thing, but again, it
needs the whole thing at once. pgjdbc's PGConnection has an overload
of createArrayOf that takes Object rather than Object[], so that
primitive arrays can be passed.

There seems to be a wide-open field for designing an API that presents
PostgreSQL's array capabilities nicely.

So much for designing the API to expose. As for doing the marshalling/
unmarshalling, PL/Java can provide you a VarlenaWrapper.Input over an
existing PostgreSQL array, and a VarlenaWrapper.Output for supplying
an array to PostgreSQL.

(I will have to add that ability, and control it with a permission, as
it will be a footgun, but those classes exist now).

VarlenaWrapper.Input is little more than a java.nio.ByteBuffer over
the array in PostgreSQL native memory. Using the various primitive-typed
get... methods of ByteBuffer, you can do the index->offset computations
and read elements freely. That's easy, and you never even have to copy
the thing into Java's heap if you don't want to.

One complication: the java.nio.ByteBuffer API does not make any provision
for invalidating a ByteBuffer. If you have one, you might try to read from
it, even after PostgreSQL has reused that memory. VarlenaWrapper.Input
has pin and unpin methods, and you are on your honor never to read from
the buffer unless you have pinned it at the time, nor to hold it pinned
forever. If you try to pin it after the native object has gone away,
pin() will throw an appropriate exception for you.

VarlenaWrapper.Output looks, at present, like an OutputStream that can
only be written sequentially, again accumulating in PostgreSQL's memory.

That was all that was needed to implement SQLXML, but it might not be
perfect for arrays. It is fine for an array with no nulls; if you can
be fed the elements in sequence you can write them straight out to the
stream and probably have a marshaller written before lunch.

For an array with nulls, because the bitmap is toward the front,
being limited to sequential writing means you would need to know
where all the nulls were before beginning to write values. That
could lead to an awkward API.

The restriction to sequential writing was partly to support the notion
of a VarlenaWrapper.Verifier, a thread that reads the content as you are
writing it to verify the format. That is needed to implement an API like
SQLXML (which allows the caller to supply a String or character stream,
and therefore has to verify it is XML). Clearly it would be no good to
write something acceptable to the verifier, then back up and change it.
Sequential writing also allows the implementation to keep only a small
ByteBuffer windowing the region of the native object currently being
written.

An array marshaller would presumably only write valid arrays and not
offer the caller a "write raw bytes" option needing a verifier. So
it could use the NoOp verifier, and I might be able to relax the
behavior of VarlenaWrapper.Output when the NoOp verifier is used,
Short of full random access, maybe the ability to 'bookmark' a certain
region while writing, like the null bitmap, and return to it later,
would be enough. I could be talked into adding that.

Krzysztof, I think I have a better understanding now of what you
might be doing, having found your 2014 PGCon slides. That was only
shortly after I got involved with PL/Java and I missed your presentation
at the time.

When you say you have tens of terabytes of these arrays, how big
are the individual arrays? Is it a huge number of small arrays?
Are the individual arrays large enough that PG's toaster compresses
them? Or are they below that threshold but packed into larger tuples
that will get toasted? Or are the tuples also smaller than that
threshold, but there are staggering numbers of them?

I'm just, at this point, kind of brainstorming what would be useful
capabilities to have.

Are you in much of a hurry? 1.5.6 is released now and will run on
Java 8, but if you are willing to use Java >= 9 and wait a little
for 1.6.x, that will make many things easier. I may want to get
1.6.0 out the door (shortly), without provisions for this, and
then shoot for a 1.6.1 that will offer a way to expose the
VarlenaWrapper functionality for you.

This does all align with refactoring I have already been wanting to do,
so it could be very useful to have someone with an application ready
to pound on it some.

Regards,
-Chap

In response to

Browse pljava-dev by date

  From Date Subject
Next Message Chapman Flack 2020-10-19 00:55:48 PL/Java 1.5.6 and 1.6.0 released
Previous Message Chapman Flack 2020-09-27 18:07:12 Can anyone test master branch? was: Building PLJava on Solaris issues