Re: More efficient OR

From: PFC <lists(at)boutiquenumerique(dot)com>
To: KeithW(at)narrowpathinc(dot)com, "PostgreSQL SQL" <pgsql-sql(at)postgresql(dot)org>
Cc: "Sean Davis" <sdavis2(at)mail(dot)nih(dot)gov>, "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>
Subject: Re: More efficient OR
Date: 2005-02-16 18:52:13
Message-ID: opsmaz1bxath1vuj@musicbox
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql


You sound like you don't like the performance you get with OR or IN, from
this I deduce that you have a very large list of values to OR from. These
make huge queries which are not necessarily very fast ; also they are
un-preparable by their very nature (ie. the planner has to look at each
value, ponder its stats, think about it...) Basically a query with, say,
'column IN (100 values)' will make postgres work a lot more than a query
with 'column in (SELECT something which yields 100 values)'.

I have tested the following with good results :

- Write a very simple set returning function which takes an array as a
parameter and RETURN NEXT each array element in turn. It's just a FOR...
RETURN NEXT. Say you call it array_flatten( INTEGER[] ) or something.

- Then, instead of doing SELECT * FROM table WHERE id IN (1,4,77,586,1025)
do:
SELECT * FROM table WHERE id IN (SELECT * FROM
array_flatten( '{1,4,77,586,1025}' ) );
or :
SELECT t.* FROM table t, (SELECT * FROM array_flatten( '{1,4,77,586,1025}'
) ) foo WHERE t.id=foo.id;

The first one will do a uniqu'ing on the array, the second one will not.

You can also LEFT JOIN against your SRF to get the id's of the rows that
were not in the table (you cannot do this with IN)

And you can PREPARE the statement to something that will take an array as
a parameter and won't have to be parsed everytime.

Sometimes it can be a big performance boost. Try it !

However, if some value in your array matches a lot of rows in the table,
it will be slower than the seq scan which would have been triggered by the
planner actually seeing that value in the IN list and acting on it. But if
you KNOW your column is unique, there is no point in forcing the planner
to ponder each value in your list !

For an additional performance boost (likely negligible), you could sort
your array in the function (or even in your application code) to ease the
work of the index scanner, which will get a better cache hit rate.

If you have, say, 20.000 values to get, this is the only way.

Note that you could ask yourself why you need to get a lot of values. Are
you fetching stuff from the database, computing a list of rows to get,
then SELECTing them ? Then maybe you put something in the application that
should really be in the database ?

As a side note, it would be nice :

- if that set returning function was a fast C built-in (I think there's
one in contrib/intagg but obviously it works only for integers) because
it's a useful tool and building brick ; same for array_accum, and some
other commonly used five-lines aggregates and functions that everybody
recodes once.

- if postgresql used this kind of optimization for the SELECT * FROM table
WHERE id =ANY( array ) which currently uses a seq scan.

However, what IS nice from pg is that you can actually do the array SRF
trick and pull a lot of rows by id's, at blazing speed, just by writing a
three line function and tweaking your wuery.

> It sounds like IN will save some typing and code space but not decrease
> the
> execution time.
>
> BETWEEN won't work for my real life query because the limiting values are
> quite disparate.
>
> Kind Regards,
> Keith
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message PFC 2005-02-16 18:56:25 Re: Relation in tables
Previous Message KÖPFERL Robert 2005-02-16 18:36:41 Re: More efficient OR