Re: Using Expanded Objects other than Arrays from plpgsql

From: Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Using Expanded Objects other than Arrays from plpgsql
Date: 2024-10-21 17:23:33
Message-ID: CACxu=vKEF8Qa-OaADFxf0uMg-xw6gH_CNCWd2s+xaqh-gY4=xg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On Sun, Oct 20, 2024 at 8:46 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com> writes:
> > On Sun, Oct 20, 2024 at 10:13 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>

(from thread
https://www.postgresql.org/message-id/CACxu%3DvJaKFNsYxooSnW1wEgsAO5u_v1XYBacfVJ14wgJV_PYeg%40mail.gmail.com
)

> >> But it seems like we could get an easy win by adjusting
> >> plpgsql_exec_function along the lines of
> >> ...
>
> > I tried this change and couldn't get it to work, on the next line:
> > if (VARATT_IS_EXTERNAL_EXPANDED_RW(DatumGetPointer(var->value)))
> > var->value might not be a pointer, as it seems at least from my gdb
> > scratching, but say an integer. This segfaults on non-array but
> > non-expandable datum.
>
> We need the same test that exec_assign_value makes,
> !var->datatype->typbyval, before it's safe to apply DatumGetPointer.
> So line 549 needs to be more like
>
> - if (!var->isnull && var->datatype->typisarray)
> + if (!var->isnull && !var->datatype->typbyval)
>
> > Another comment that caught my eye was this one:
> >
> https://github.com/postgres/postgres/blob/master/src/pl/plpgsql/src/pl_exec.c#L8304
> > Not sure what the implication is there.
>
> Yeah, that's some more unfinished business. I'm not sure if it
> matters to your use-case or not.
>
> BTW, we probably should move this thread to pgsql-hackers.

And here we are, thanks for your help on this Tom. For some thread
switching context for others, I'm writing a postgres extension that wraps
the SuiteSparse:GraphBLAS API and provides new types for sparse and dense
matrices and vectors. It's like a combination of numpy and scipy.sparse
but for Postgres with an emphasis on graph analytics as sparse adjacency
matrices using linear algebra.

I use the expandeddatum API to flatten and expand on disk compressed
representations of these objects into "live" in-memory objects managed by
SuiteSparse. All GraphBLAS objects are opaque handles, and my expanded
objects are essentially a box around this handle. I use memory context
callbacks to free the handles when the memory context of the box is freed.
This works very well and I've made a lot of progress on creating a very
clean algebraic API, here are the doctests for matrices, this is all
generated from live code!

https://onesparse.github.io/OneSparse/test_matrix_header/

Doing some benchmarking I noticed that when writing some simple graph
algorithms in plpgsql, my objects were being constantly flattened and
expanded. Posting my question above to pgsql-general Tom gave me some tips
and suggested I move the thread here.

My non-expert summary: plpgsql only optimizes for expanded objects if they
are arrays. Non array expanded objects get flattened/expanded on every
use. For large matrices and vectors this is very expensive. Ideally I'd
like to expand my object, use it throughout the function, return it to
another function that may use it, and only flatten it when it goes to disk
or it's completely unavoidable. The comment in expandeddatum.h really kind
of sells this as one of the main features:

* An expanded object is meant to survive across multiple operations, but
* not to be enormously long-lived; for example it might be a local variable
* in a PL/pgSQL procedure. So its extra bulk compared to the on-disk
format
* is a worthwhile trade-off.

In my case it's not a question of saving bulk, the on disk representation
of a matrix is not useful at compute time, it needs to be expanded (using
GraphBLAS's serialize/deserialize API) for it to be usable for matrix
operations like matmul. In most cases algorithms using these objects
iterate in a loop, doing various algebraic operations almost always
involving a matmul until they converge on a stable solution or they exhaust
the input elements. Here for example is a "minimum parent BFS" that takes
a graph and a starting node, and traverses the graph breadth first,
computing a vector of every node and its minimum parent id.

CREATE OR REPLACE FUNCTION bfs(graph matrix, start_node bigint)
> RETURNS vector LANGUAGE plpgsql AS
> $$
> DECLARE
> bfs_vector vector = vector('int32');
> next_vector vector = vector('int32');
> BEGIN
> bfs_vector = set_element(bfs_vector, start_node, 1);
> WHILE sssp_vector != next_vector LOOP
> next_vector = dup(bfs_vector);
> bfs_vector = vxm(bfs_vector, graph, 'any_secondi_int32',
> w=>bfs_vector, accum=>'min_int32');
> END LOOP;
> RETURN bfs_vector;
> end;
> $$;
>

(If you're wondering "Why would anyone do it this way" it's because
SuiteSparse is optimized for parallel sparse matrix multiplication and has
a JIT compiler that can target multiple architectures, at the moment CPUs
and CUDA GPUs. Reusing the same Linear Algebra already prevalent in graph
theory means not having to think about any low level implementation issues
and having code that is completely portable from CPU to GPU or other
accelerators).

So, I made the two small changes Tom suggested above and I have them in a
side fork here:

https://github.com/postgres/postgres/compare/master...michelp:postgres-upstream:michelp-flatless#diff-0c35024d1576c347689c7abad68abd8562a0aa5f0d2c63d6d65df4b360b0e807

Good news, my code still works, but bad news is there is still a lot of
flattening/expanding/freeing going on at multiple points in each iteration
of the algorithm. I'll note too that:

BEGIN
bfs_vector = set_element(sssp_vector, start_node, 1);

I'd prefer that to not be an assignment, set_element mutates the object (I
eventually plan to support subscripting syntax like bfs_vector[start_node]
= 1)

same with:

bfs_vector = vxm(bfs_vector, graph, 'any_secondi_int32',
w=>bfs_vector, accum=>'min_int32');

This matmul mutates bfs_vector, I shouldn't need to reassign it back but at
the moment it seems necessary otherwise the mutations are lost but this
costs a full flatten/expand cycle.

Short term my goal is to optimize plpgsql so that my objects stay expanded
for the life of the function. Long term there's some "unfinished business"
to use Tom's words around the expandeddatum API. I'm not really qualified
to speak on these issue but this is my understanding of some of them:

- plpgsql knows how to expand arrays and is hardwired for it, but how
would it know how to expand other expandable types?

- Issues with exec_check_rw_parameter also being hardwired to only
optimize expanded objects for array append and prepend, I suspect this has
something to do with my issue above about mutating objects in place.

I may have missed something but hopefully that brings anyone up to speed
interested in this topic.

-Michel

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gianni Ceccarelli 2024-10-21 17:26:15 Re: Timezone: resolve $TZDIR in runtime
Previous Message Anatolii Smolianinov 2024-10-21 17:01:36 Re: Timezone: resolve $TZDIR in runtime

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-10-21 17:32:33 Re: CSN snapshots in hot standby
Previous Message Michail Nikolaev 2024-10-21 17:06:59 Re: [BUG?] check_exclusion_or_unique_constraint false negative