Re: table-valued arguments for functions

From: "Wappler, Robert" <rwappler(at)ophardt(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Re: table-valued arguments for functions
Date: 2009-12-22 13:46:25
Message-ID: C8E2DAF0E663A948840B04023E0DE32A0241E01D@w2k3server02.de.ophardt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> -----Original Message-----
> From: Filip Rembiałkowski [mailto:plk(dot)zuber(at)gmail(dot)com]
> Sent: Monday, December 21, 2009 6:51 PM
> To: Wappler, Robert
> Cc: pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL]
>
>
> 2009/12/21 Wappler, Robert <rwappler(at)ophardt(dot)com>
>
>
> Hello,
>
> when defining functions, why can't they take table-valued arguments?
> I do not see the sense of this restriction. When a function takes a table
> as an argument, whole tables could be transformed in an obvious way.
> Currently functions must operate row-wise (at least I haven't found
> another way). This also limits the capabilities of joining the row-
> arguments, obviously this must be done before passing the arguments to the
> functions, instead of having one or more table-arguments and simply doing
> the joins within the function. Further, table valued arguments could lower
> the developer's work to prepare the arguments before the function call and
> enabling each functions to care for them itself by transforming them for
> the implementation instead of introducing several more or less anonymous
> record types just for passing values from one function to another which
> often also carry implementation specific constraints. The structure of the
> code as a set of functions would also greatly improve.
>
>
>
>
> SQL itself is a language which operates on data sets. Interesting, why you
> need a functional interface here?
>

Mainly for processing batches in a handy way.

> Could you give some example - what you want to achieve, and what's the
> missing element?
> ( please add some DDL and code if possible ).
>
>
> As you know, every table in postgres has its rowtype definition.
>
> CREATE TABLE aaa ( ... );
> CREATE FUNCTION aaa_do (aaa) ...;
> SELECT aaa_do(aaa.*) from aaa;
>
> but that's the row-wise way.
>
> You can also operate on tables via their *names*:
> SELECT somefunc( 'aaa' ); /* some dynamic SQL inside */
>
> You will find some examples in contrib/tablefunc
>
>
>
>
>
>
> What does the optimizer do with this per-row-semantic, even if it is
> invoked on each row? Does it transform the queries within the function to
> directly work on the table and realize possible advantages from indexes
> and other techniques?
>
> Yes that's what it does, in general.
> But, if processing is all "canned" inside user defined functions, the
> optimizer can't do it's job.
>
>

Assume we have a set of sensors, which regularly send some data. Depending on the frequency, the data is sent, the performance for insertion can become very important. So usually a very simple schema is used for the staging, e.g

CREATE TABLE sensor_data (
id SERIAL PRIMARY KEY,
sensor_id int,
time timestamp NOT NULL DEFAULT NOW(),
temperature real,
humidity real);

The sensor id alone does not carry enough information for creating useful reports, other tables may provide location information.

CREATE TABLE sensor_location (
sensor_id int PRIMARY KEY,
location text NOT NULL);

A periodic job may now take a bulk of rows from the sensor_data table and insert them incrementally into tables used to create reports.

CREATE TABLE temperature_per_location (
location text PRIMARY KEY,
avg_temperature real,
no_of_measurements int);

The row-wise way to update temperature_per_location could be

CREATE OR REPLACE FUNCTION update_tpl(measurement sensor_data)
RETURNS void VOLATILE AS $$
UPDATE temperature_per_location AS tpl2
SET (avg_temperature, no_of_measurements) =
(SELECT (tpl.no_of_measurements * tpl.avg_temperature
+ $1.temperature) / (tpl.no_of_mesaurements + 1))
AS avg_temperature,
tpl.no_of_measurements + 1 AS no_of_measurements
FROM sensor_location l JOIN temperature_per_location tpl USING
(location)
WHERE $1.sensor_id = l.sensor_id)
FROM sensor_location l2
WHERE $1.sensor_id = l2.sensor_id
AND l2.location = tpl2.location;
$$ LANGUAGE SQL;

Obviously the above function must be called from inside a loop or within a SELECT-list to process a whole bulk of rows. At least inside a loop, a cannot see, how an optimizer could make use of indexes, which may exist on sensor_location. The actual order of the updates will depend on the order the rows are passed in from table sensor_data. The call has to be prepared e.g. by

DECLARE
batch SETOF sensor_data;
BEGIN
SELECT * FROM select_batch() INTO batch;

FOR b IN batch
update_tpl(b);
END FOR;
END;

Or a bit shorter but still verbose
SELECT update_tpl(id, sensor_id, time, temperature, humidity)
FROM select_batch();

Assuming we could pass a bulk of rows as a table, the update could be performed as follows:

CREATE OR REPLACE FUNCTION update_tpl(measurements SETOF sensor_data) RETURNS void VOLATILE AS $$
UPDATE temperature_per_location AS tpl
SET (avg_temperature, no_of_measurements) =
((tpl.no_of_measurements * tpl.temperature +
m.temperature)/(tpl.no_of_measurements
+ COUNT(*) OVER(PARTITION BY l.location)), tpl.no_of_measurements + COUNT(*) OVER (PARTITION BY l.location))
FROM measurements m JOIN sensor_location l USING (sensor_id)
WHERE tpl.location = l.location;
$$ LANGUAGE SQL;

The second version wouldn't need an enclosing loop, it could also use directly a combined index on sensor_location (sensor_id, location) to
optimize the iteration performed in the update. The code for the invocation may be

SELECT * FROM update_tpl(select_batch());

This example is synthetic and quite simple, obviously it lacks removal of rows from the staging table as well as an initial insertion before performing the update in the temperature_per_location table. But I think it illustrates the feature of table-valued arguments.

Another point where I met the wish perform operations on table-valued arguments is the insertion of missing values, which often examines dependencies between different rows. For now this means to determine these rows before calling the function and passing the arguments to it or to make a set of incomplete data explicit by inserting it in an auxiliary table before continuing the processing or doing it within one function using variables and/or sub-queries. I'm not a fan of doing everything in one lengthy function. Functions are also a tool to structure the code, e.g. into select_batch() returning SETOF sometype, add_location_information(SETOF sometype) returning SETOF other_type, add_missing_values(SETOF other_type) returning other type and finally update_tables(SETOF other_type) leading to easily changeable implementations of the different tasks.

In response to

  • Re: at 2009-12-21 17:51:07 from Filip Rembiałkowski

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Filip Rembiałkowski 2009-12-22 14:01:02 Re: Drop/ Alter index if exist
Previous Message A. Kretschmer 2009-12-22 13:39:00 Re: Not finding RPMs for 8.4.2!