From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | andrew(at)supernews(dot)com |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: slow IN() clause for many cases |
Date: | 2005-10-14 22:55:10 |
Message-ID: | 7976.1129330510@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Andrew - Supernews <andrew+nonews(at)supernews(dot)com> writes:
> With everything in cache, selecting 1000 random rows from a 200k row table,
> I get:
>
> for IN (list): planning time 47ms, execution time 27ms
> array/nestloop: planning time 8ms, execution time 34ms
The reason for the slow planning time is that IN (list) is currently
converted (by the grammar) into an OR structure:
(x = val1) OR (x = val2) OR (x = val3) OR (x = val4) OR ...
which means that the planner has to consider each subclause separately:
discover that it can be matched to an index, build the indexscan plan,
etc. Even just looking up all the "=" operators during parsing takes a
fair amount of time :-( The large number of plan nodes also imply
relatively slow executor startup.
It strikes me that now that we have the bitmap indexscan mechanism,
it wouldn't be hard to do better. I'm thinking that IN should be
converted to a ScalarArrayOpExpr, ie
x = ANY (ARRAY[val1,val2,val3,val4,...])
and then we could treat both OpExpr and ScalarArrayOpExpr as matching
an index when the LHS is the index key and the operator is in the
index's opclass. This wouldn't fit comfortably into a plain indexscan,
but a bitmap indexscan has already got the mechanism for ORing together
the results of several primitive indexscans (one per array element).
You just do the scans and insert all the results into your output
bitmap.
This approach would reduce the planner and executor-startup costs of
even a long IN list to be pretty comparable to a single index key
comparison. The actual runtime of the plan wouldn't change much,
I think, but it's the overhead that's hurting us here.
This also solves the problem mentioned earlier in the thread that
you can't make a prepared plan for varying lengths of IN-lists:
instead of using IN, you do something like
x = ANY($1::int[])
and then ship your lookup keys as a single array parameter instead of
multiple scalars.
The main objection I can see is that this technique requires the
elements of the IN list to be converted to a common datatype (ie, the
element datatype of the ARRAY construct). Historically our code has
made no such assumption. However, I believe that the SQL standard
expects that conversion to happen: "x IN (list)" is defined as
"x = ANY (VALUES list)" and <table value constructor> expects all its
rows to be converted to a common rowtype. So we could point to the
standard if anyone complains that the behavior changed.
Obviously this is not happening for 8.1, but it seems very do-able for
8.2.
Comments?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2005-10-14 23:09:17 | Re: slow IN() clause for many cases |
Previous Message | Neil Conway | 2005-10-14 21:42:44 | Re: Open items |