cursors, temp tables, dynamic sql technique (was and is: looping over a small record set over and over in a function)

From: Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: cursors, temp tables, dynamic sql technique (was and is: looping over a small record set over and over in a function)
Date: 2009-06-21 12:57:16
Message-ID: 20090621145716.51aec2b7@dawn.webthatworks.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, 21 Jun 2009 10:57:51 +0800
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote:

> On Fri, 2009-06-19 at 20:23 +0200, Ivan Sergio Borgonovo wrote:
>
> > If I could easily load all the dataset into an array, loop
> > through it and then just update the computed field it would be
> > nice... but how?
>
> Are you sure you _really_ need a PL/PgSQL function for this?

Not really sure. I'm investigating.

But well I noticed that even working with a fully fledged procedural
language I'll have to loop 2 times to replace/add to the "original"
field the computed one, so it's surely not a big loss if I "loop" 2
times with SQL.

I think everything could be summed up as:

select into t myaggregate1(field) from dataset where condition1;
if(t>10) then
update dataset set field=myfunc1(a,b,c) where condition1;
end if;

select into t myaggregate2(field) from dataset where condition2;
if(t>44) then
update dataset set field=myfunc2(a,b,c) where condition2;
end if;

Actually there is a trick I could use to "skip" the update loop and
pass the result to the next loop but it looks convenient in a more
general case if the aggregate should be computed on the "updated"
value. I'm still not sure if the class of function I'm working with
are always of the above class where myaggregate works on the
"original" field.

Of course myaggregate could compute myfunc... but then I'll have
to compute myfunc twice.

In a loop I could

create temp table t as
select *, field as field1, field as field2, field as result
from t1
join t2 on...
join t3 on...
where ...
on commit drop;

flipflop='field1';
flopflip='field2';

foreach(row) {
if(condition on row) {
flopflip=myfunc(row[flipflop]);
} else {
flopflip=row[flipflop];
}
agg+=flopflip;
}
if(condition on agg) {
switch flipflop
}

foreach(row) {
if(condition on row) {
flopflip=myfunc(row[flipflop]);
} else {
flopflip=row[flipflop];
}
agg+=flopflip;
}
if(condition on agg) {
switch flipflop
}

execute 'update t set result=' || flipflop;

I think this could be obtained using cursors.
This is going to be much more verbose since to use the "flipflop"
technique I'll have to use dynamic statements and EXECUTE.

EXECUTE
'UPDATE t SET' || flipflop || '=' || newfield ||
'WHERE CURRENT OF cursor';
even "condition on row" should be expressed as a dynamic statement
if condition involve the computed field.

Compared to the select + update solution it is going to avoid
looping and checking the condition twice but it is going to work on
a larger dataset and run the update even for unchanged row (maybe
the else condition could be omitted??).
I'm not sure that actual set of functions have the update condition
equal to the select condition anyway.
But I think wasting updates has a larger cost if they require disk
IO.

I think stuff may look more "convenient" if there was a simple and
cheap way to load a record set into an array, do stuff with the
procedural language and substitute the old record set with
the computed one.

Currently I'll get most of the speed up from looping through a
smaller data set avoiding to join over a very large table.
Considering the very small data set I'm expecting (no more than 20
record) I don't think dynamic statement, lack of indexes etc... are
going to have an impact.

As soon as clients will increase, optimizing the loops may be
necessary.
I'll have to see if all the cases I'm dealing with could be solved
by the SELECT + conditional UPDATE technique since it looks much
easier to maintain.
Furthermore I suspect that for my real case the conditional check
will be made on the original value so that myfunction will be
computed at most once and I can delay an aggregate computation on
the computed field after the last loop has been executed.
The remaining costs will be:
- looping twice on the same table (getting the aggregate + updating
the table)
- updating if it causes disk IO

I even suspect that the test will mostly fail so that updates will
be rare but I still would like to understand how this could work in
the most general case and how the temp table + cursor + flipflop
technique is going to work internally.

I think I really don't have a clear picture of how temp tables
really work.
They can be seen by concurrent transactions in the same session.
But if the transaction in which a temp table is created is not
committed yet, other transactions won't see it.
If the transaction where the temp table is created is aborted... no
other transaction will ever know about the existence of the temp
table.

If the temp table is defined as DROP ON COMMIT... somehow there is
no need to make the changes happening on the temp table land on
disk. So it may actually look more as a temporary, private storage
that doesn't have to be aware of concurrency.

So temp tables should require less disk IO. Is it?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gerhard Wiesinger 2009-06-21 13:23:03 Re: Implicit conversion/comparision of timestamp with and without timezone
Previous Message Richard Huxton 2009-06-21 11:20:36 Re: Implicit conversion/comparision of timestamp with and without timezone