From: | Phil Frost <indigo(at)bitglue(dot)com> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | stable function optimizations, revisited |
Date: | 2006-05-12 17:59:08 |
Message-ID: | 20060512175908.GA30515@unununium.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
I have recently been encountering a number of significant performance
problems related to stable functions being called multiple times when I
believe they could be called just once. Searching the ML archives, I see
I'm not the first:
<http://archives.postgresql.org/pgsql-hackers/2003-04/msg00890.php>
<http://archives.postgresql.org/pgsql-performance/2006-01/msg00140.php>
and so on. None of them seemed to resolve to a plan of action or elegant
workaround. It is mentioned that "stable" was added to allow such
functions to be used for index scans, but I could not tell if other
optimizations I would like are impossible, or possible and if so, might
or will never be implemented.
I have several examples of queries in which eliminating extra calls to a
stable function would result in very significant performance gains. All
of these cases were found while developing a real application, and
although I've simplified them to be more readable, they are not
contrived.
Problem 1: creating a view with multiple columns calculated from a
function.
create table sale(saleid serial, total numeric);
create function cost_of_sale(sale.saleid%type) returns numeric stable as $$
-- calculates the cost of purchasing the things sold in a sale
-- takes considerable time to calculate
$$;
create view convenient_view_on_sale as
select *,
cost_of_sale(saleid) as cost,
total - cost_of_sale(saleid) as profit,
case when total != 0 then (total-cost_of_sale(saleid)) / total * 100 end as margin;
Executing "select * from convenient_view_on_sale limit 1" will execute
cost_of_sale thrice. However, from the definition of stable, we know it
could have been called just once. As cost_of_sale takes hundreds of ms
to execute while the rest of the query is extremely simple, additional
calls in effect multiply the total execution time.
Nonsolution 1a: moving function to a subselect:
create view convenient_view_on_sale as
select *,
total - cost as profit,
case when total != 0 then (total-cost) / total * 100 end as margin
from (select *, cost_of_sale(saleid) as cost from sale) as subq;
The query planner will eliminate the subselect, and cost_of_sale will
still be executed thrice. I can observe no change in behaviour
whatsoever with this view definition.
PS: I wonder what the behaviour would be if I explicitly inlined
cost_of_sale here?
Nonsolution 1b: preventing optimization of the subselect with "offset 0"
create view convenient_view_on_sale as
select *,
total - cost as profit,
case when total != 0 then (total-cost) / total * 100 end as margin
from (select *, cost_of_sale(saleid) as cost from sale offset 0) as subq;
This helps in the case of a "select *"; the subquery will not be
eliminated due to the "offset 0", and cost_of_sale will be executed only
once. However, it will always be executed, even if none of the cost
related columns are selected. For exaple,
"select saleid from convenient_view_on_sale limit 1" will execute
cost_of_sale once, although it could have not been executed at all.
Problem 1 has a workaround: perform the dependant calculations (profit
and margin in this case) on the client, or in a stored procedure. This
is often inconvienent, but it works.
Problem 2: expensive functions returning composite types.
Consider that the purchases for a sale might have not yet been made, so
the exact cost can not be known, but a guess can be made based on the
current prices. cost_of_sale might be updated to reflect this:
create function cost_of_sale(sale.saleid%type, out cost numeric, out estimated bool)
stable as $$ ... $$;
create view convenient_view_on_sale as
select *, cost_of_sale(saleid) from sale;
Note that in many cases, calculating "cost" and "estimated" together
takes just as long as calculating either one. This is why both are
returned from the same function.
Now, I use python as a client, in particular the psycopg2 module. When I
do something such as "select cost from convenient_view_on_sale", the
values returned for the cost column (a composite type (numeric, bool))
are strings. Perhaps this is an issue with psycopg2, but as a user, this
is very annoying since I can not really get at the components of the
composite type without reimplementing pg's parser. Granted I could
probably do it simply in a way that work work most the time, but I feel
it would be error prone, and I'd rather not.
Thus, I seek a way to get the components of the cost column in top-level
columns. For example I try, "select (cost).cost, (cost).estimated", but
this now executes cost_of_sale twice, doubling the time of my query.
Since stable functions are the most common in my experience, and I have
quite a number of them that perform complex, slow queries, I'd really
like to see optimizations in this area. Until such a time, I would very
much appreciate any workaround suggestions.
From | Date | Subject | |
---|---|---|---|
Next Message | Jim C. Nasby | 2006-05-12 19:12:13 | Re: Wrong plan for subSELECT with GROUP BY |
Previous Message | Jim Nasby | 2006-05-12 16:09:27 | Re: slow variable against int?? |