Re: Functions as conditions

From: "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Bao Ton <oabnot(at)gmail(dot)com>
Cc: "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Functions as conditions
Date: 2016-04-21 20:14:11
Message-ID: CAKFQuwbpes2eEm9m=jeQCA79NfjuPGMjY4rd2zf4yH48Vw6Bag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Thu, Apr 21, 2016 at 12:24 PM, Bao Ton <oabnot(at)gmail(dot)com> wrote:

> Hello all,
>
> I'd like some verification on how functions as conditions work and
> perform on a table, specifically a function that would take a specific
> column as input, and output a boolean value.
>
> psql --version
> psql (PostgreSQL) 9.5.0
>
> Suppose I have a table with two columns, id and an md5 with 700000 rows:
> CREATE TABLE t_random AS SELECT s, md5(random()::text) FROM
> generate_Series(1,700000) s;
>
> Let's create a function that checks if column s is over 500:
>
> CREATE OR REPLACE FUNCTION test_func(integer)
> RETURNS boolean
> LANGUAGE plpgsql
>

​You should always use "sql" if your function can be written to use it
acceptably. In this case it can.
[...]
STRICT -- minimal help but its nice to specify your intended behavior when
faced with null
IMMUTABLE -- your example has no external dependencies
LANGUAGE sql
AS $function$
SELECT $1 > 500;
$function$​;

> AS $function$
> BEGIN
> IF $1 > 500 THEN
> return true;
> ELSE
> return false;
> END IF;
> END$function$;
>
> Running:
> EXPLAIN ANALYZE SELECT * FROM t_random WHERE *test_func(s)*
> QUERY
> PLAN
>
> --------------------------------------------------------------------------------------------------------------------
> Seq Scan on t_random (cost=0.00..187834.00 rows=*233333* width=37)
> (actual time=0.505..572.463 rows=*699500* loops=1)
> Filter: test_func(s)
> Rows Removed by Filter: 500
> Planning time: 0.025 ms
> Execution time: 600.917 ms
>
> As opposed to:
> EXPLAIN ANALYZE SELECT * FROM t_random WHERE *s > 500;*
> QUERY
> PLAN
>
> ------------------------------------------------------------------------------------------------------------------
> Seq Scan on t_random (cost=0.00..14584.00 rows=*699546* width=37)
> (actual time=0.080..75.001 rows=*699500* loops=1)
> Filter: (s > 500)
> Rows Removed by Filter: 500
> Planning time: 0.029 ms
> Execution time: 98.257 ms
>
> The first query plan has a horrendous query time and estimations using
> test_func(). I'm assuming because the function has 0 statistics on the
> table itself. It can't base its' input off any index, meaning it has no
> sense of how big the table is correct? I'm not exactly sure the reason here.
>

​Both are sequential scans...the most likely reason is the overhead of
invoking a volatile plpgsql function 700,000 times

>
> Let's assume that this function is absolutely necessary and that it
> changes based on some environment variables, while maintaining only a
> single table. We can not have a static column with fixed boolean values
> since this column will always be changing depending on the environment.
> Would a good solution then be to use views and create a column based on the
> function?
>
> CREATE VIEW t_view AS SELECT *, test_func(s) AS passed FROM t_random;
>


The addition of the view is immaterial from a performance standpoint - only
usability is affected.

>
> Running:
> EXPLAIN ANALYZE SELECT * FROM t_view WHERE passed = true:
> QUERY
> PLAN
>
> --------------------------------------------------------------------------------------------------------------------------
> Subquery Scan on t_view (cost=0.00..194834.00 rows=*350000* width=38)
> (actual time=2.198..661.008 rows=*699500* loops=1)
> Filter: t_view.test_func
> Rows Removed by Filter: 500
> -> Seq Scan on t_random (cost=0.00..187834.00 rows=*700000*
> width=37) (actual time=0.026..579.125 rows=*700000* loops=1)
> Planning time: 0.040 ms
> Execution time: 690.057 mss
>
> The query plan does a bit better but is this the best plan I can hope to
> achieve with the specific function? Materialized views are not an option
> since the environment is changing constantly and I would like an
> always-up-to-date database.
>
>
​I'm lost but maybe someone else will want to try and figure out you mean.
Otherwise consider a concrete example.

​If the output of the function is truly non-immutable you are going to have
significant difficulty getting any kind of optimization during execution -
the best you can hope for is to write the function in question as
efficiently as possible and only call it when all other possible conditions
already pass.​ IOW, you will always be faced with a sequential scan when
it comes to evaluating the function's result.

With an immutable function you could create a functional index...

David J.

In response to

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Steve Crawford 2016-04-21 20:17:27 Re: postgres connection issues
Previous Message Markus Metz 2016-04-21 19:25:02 postgres connection issues