Re: Using Expanded Objects other than Arrays from plpgsql

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Using Expanded Objects other than Arrays from plpgsql
Date: 2024-10-22 19:33:40
Message-ID: 2062830.1729625620@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com> writes:
> Doing some benchmarking I noticed that when writing some simple graph
> algorithms in plpgsql, my objects were being constantly flattened and
> expanded. Posting my question above to pgsql-general Tom gave me some tips
> and suggested I move the thread here.

> My non-expert summary: plpgsql only optimizes for expanded objects if they
> are arrays. Non array expanded objects get flattened/expanded on every
> use. For large matrices and vectors this is very expensive. Ideally I'd
> like to expand my object, use it throughout the function, return it to
> another function that may use it, and only flatten it when it goes to disk
> or it's completely unavoidable.

Right. My recollection is that when we first invented expanded
objects, we only implemented expanded arrays, and so it seemed
premature to try to define generalized APIs. So that went on the back
burner until we got some more cases to look at. But now we also have
expanded records, and with your use-case as an example of an extension
trying to do similar things, I feel like we have enough examples to
move forward.

As far as the hack we were discussing upthread goes, I realized that
it should test for typlen == -1 not just !typbyval, since the
VARATT_IS_EXTERNAL_EXPANDED_RW test requires that there be a length
word. With that fix and some comment updates, it looks like the
attached. I'm inclined to just go ahead and push that. It won't move
the needle hugely far, but it will help, and it seems simple and safe.

To make more progress, there are basically two areas that we need
to look at:

1. exec_assign_value() has hard-wired knowledge that it's a good idea
to expand an array that's being assigned to a plpgsql variable, and
likewise hard-wired knowledge that calling expand_array() is how to
do that. The bit in plpgsql_exec_function() that we're patching
in the attached is the same thing, but for the initial assignment of
a function input parameter to a plpgsql variable. At the time this
was written I was quite unsure that forcible expansion would be a net
win, but the code is nine years old now and there's been few or no
performance complaints. So maybe it's okay to decide that "always
expand expandable types during assignment" is a suitable policy across
the board, and we don't need to figure out a smarter rule. It sounds
like that'd probably be a win for your application, which gives me
more faith in the idea than I would've had before. That leaves the
problem of how to find the right code to call to do the expansion.
Clearly, we need to attach something to data types to make that
possible. My first thought was to invent a pg_type column (and
CREATE TYPE/ALTER TYPE options, and pg_dump support, yadda yadda)
containing the OID of a function corresponding to expand_array().
However, that work would have to be done again the next time we
think of something we'd like to know about a data type. So now
I'm thinking that we should steal ideas from the "prosupport" API
(see src/include/nodes/supportnodes.h) and invent a concept of a
"type support" function that can handle an extensible set of
different requests. The first one would be to pass back the
address of an expansion function comparable to expand_array(),
if the type supports being converted to an expanded object.

2. Most of the performance gold is hidden in deciding when we
can optimize operations on expanded objects that look like

plpgsql_var := f(plpgsql_var, other-parameters);

by passing a R/W rather than R/O expanded pointer to f() and letting
it munge the expanded object in-place. If we fail to do that then
f() has to construct a new expanded object for its result. It's
probably still better off getting a R/O pointer than a flat object,
but nonetheless we fail to avoid a lot of data copying.

The difficulty here is that we do not want to change the normal naive
semantics of such an assignment, in particular "if f() throws an error
then the value of plpgsql_var has not been modified". This means that
we can only optimize when the R/W parameter is to be passed to the
top-level function of the expression (else, some function called later
could throw an error and ruin things). Furthermore, we need a
guarantee from f() that it will not throw an error after modifying
the value.

As things stand, plpgsql has hard-wired knowledge that
array_subscript_handler(), array_append(), and array_prepend()
are safe in that way, but it knows nothing about anything else.
So one route to making things better seems fairly clear: invent a new
prosupport request that asks whether the function is prepared to make
such a guarantee. I wonder though if you can guarantee any such thing
for your functions, when you are relying on a library that's probably
not designed with such a restriction in mind. If this won't work then
we need a new concept.

One idea I was toying with is that it doesn't matter if f()
throws an error so long as the plpgsql function is not executing
within an exception block: if the error propagates out of the plpgsql
function then we no longer care about the value of the variable.
That would very substantially weaken the requirements on how f()
is implemented. I'm not sure that we could make this work across
multiple levels of plpgsql functions (that is, if the value of the
variable ultimately resides in some outer function) but right now
that's not an issue since no plpgsql function as such would ever
promise to be safe, thus it would never receive a R/W pointer to
some outer function's variable.

> same with:
> bfs_vector = vxm(bfs_vector, graph, 'any_secondi_int32',
> w=>bfs_vector, accum=>'min_int32');
> This matmul mutates bfs_vector, I shouldn't need to reassign it back but at
> the moment it seems necessary otherwise the mutations are lost but this
> costs a full flatten/expand cycle.

I'm not hugely excited about that. We could imagine extending
this concept to INOUT parameters of procedures, but it doesn't
seem like that buys anything except a small notational savings.
Maybe it would work to allow multiple parameters of a procedure
to be passed as R/W, whereas we're restricted to one for the
function-notation method. So possibly there's a gain there but
I'm not sure how big.

BTW, if I understand your example correctly then bfs_vector is
being passed to vxm() twice. This brings up an interesting
point: if we pass the first instance as R/W, and vxm() manipulates it,
then the changes would also be visible in its other parameter "w".
This is certainly not per normal semantics. A "safe" function would
have to either not have any possibility that two parameters could
refer to the same object, or be coded in a way that made it impervious
to this issue --- in your example, it couldn't look at "w" anymore
once it'd started modifying the first parameter. Is that an okay
requirement, and if not what shall we do about it?

Anyway that's my brain dump for today. Thoughts?

regards, tom lane

Attachment Content-Type Size
v1-absorb-any-rw-input-parameter.patch text/x-diff 2.0 KB

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Koen De Groote 2024-10-22 19:50:24 Re: Basebackup fails without useful error message
Previous Message Greg Sabino Mullane 2024-10-22 19:01:55 Re: Query performance issue

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-10-22 19:50:30 Re: Make default subscription streaming option as Parallel
Previous Message Alena Rybakina 2024-10-22 19:30:10 Re: Vacuum statistics