| From: | Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru> | 
|---|---|
| To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> | 
| Cc: | Andrei Lepikhov <lepihov(at)gmail(dot)com>, Ivan Kush <ivan(dot)kush(at)tantorlabs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> | 
| Subject: | Re: Replace IN VALUES with ANY in WHERE clauses during optimization | 
| Date: | 2025-04-02 14:33:43 | 
| Message-ID: | d621919d-904d-40eb-ab72-3f118d6fde52@postgrespro.ru | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 01.04.2025 17:23, Alena Rybakina wrote:
>
> Hi, Alexander!
>
> On 01.04.2025 15:07, Alexander Korotkov wrote:
>> Hi, Alena!
>>
>> On Tue, Apr 1, 2025 at 2:11 AM Alena Rybakina 
>> <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>>
>>     4.1) explain analyze SELECT ten
>>
>>     FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT
>>     unique2 FROM onek c WHERE c.unique2 in
>>     ((values(0),(2))))::integer)) );
>>
>>     QUERY PLAN
>>     -------------------------------------------------------------------------------------------------------------
>>     Seq Scan on onek t (cost=180.11..410.25 rows=2 width=6) (actual
>>     time=5.014..13.256 rows=3.00 loops=1) Filter: (unique1 = ANY
>>     (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer])) Rows
>>     Removed by Filter: 10005 *Buffers: shared hit=110* SubPlan 1 ->
>>     Seq Scan on onek c (cost=0.00..180.10 rows=3 width=4) (actual
>>     time=0.022..4.951 rows=2.00 loops=1) Filter: (unique2 = ANY
>>     ('{0,2}'::integer[])) Rows Removed by Filter: 10006 *Buffers:
>>     shared hit=55* Planning: Buffers: shared hit=6 dirtied=1 Planning
>>     Time: 0.502 ms Execution Time: 13.348 ms (13 rows)
>>
>>     The query plan without our patch:
>>
>>     --------------------------------------------------------------------------------------------------------------------------------------------
>>     Hash Semi Join (cost=0.05..181.42 rows=2 width=6) (actual
>>     time=5.072..9.076 rows=3.00 loops=1) Hash Cond: (t.unique1 =
>>     "*VALUES*".column1) *Buffers: shared hit=55 read=55* -> Seq Scan
>>     on onek t (cost=0.00..155.08 rows=10008 width=10) (actual
>>     time=0.145..1.802 rows=10008.00 loops=1) *Buffers: shared hit=52
>>     read=3* -> Hash (cost=0.03..0.03 rows=2 width=4) (actual
>>     time=4.908..4.912 rows=2.00 loops=1) Buckets: 1024 Batches: 1
>>     Memory Usage: 9kB *Buffers: shared hit=3 read=52* -> Values Scan
>>     on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual
>>     time=0.003..4.901 rows=2.00 loops=1) *Buffers: shared hit=3
>>     read=52* SubPlan 1 -> Hash Semi Join (cost=0.05..181.42 rows=2
>>     width=4) (actual time=0.036..4.861 rows=2.00 loops=1) Hash Cond:
>>     (c.unique2 = "*VALUES*_1".column1) *Buffers: shared hit=3
>>     read=52* -> Seq Scan on onek c (cost=0.00..155.08 rows=10008
>>     width=4) (actual time=0.009..2.120 rows=10008.00 loops=1)
>>     *Buffers: shared hit=3 read=52* -> Hash (cost=0.03..0.03 rows=2
>>     width=4) (actual time=0.006..0.008 rows=2.00 loops=1) Buckets:
>>     1024 Batches: 1 Memory Usage: 9kB -> Values Scan on "*VALUES*_1"
>>     (cost=0.00..0.03 rows=2 width=4) (actual time=0.001..0.002
>>     rows=2.00 loops=1) Planning: Buffers: shared hit=102 read=22
>>     Planning Time: 1.853 ms Execution Time: 9.281 ms (23 rows)
>>
>>
>> I think I managed to understand what is going on.
>>
>> When we run a query with SOAP over a constant array 
>> then convert_saop_to_hashed_saop_walker() provides acceleration with 
>> hashing.
>>
>> # explain analyze select * from test where val IN (5000, 4000, 9000, 
>> 2000, 1000, 140050);
>>                                               QUERY PLAN
>> -------------------------------------------------------------------------------------------------------
>>  Seq Scan on test  (cost=0.00..21925.00 rows=6 width=4) (actual 
>> time=2.015..223.984 rows=6.00 loops=1)
>>    Filter: (val = ANY ('{5000,4000,9000,2000,1000,140050}'::integer[]))
>>    Rows Removed by Filter: 999994
>>    Buffers: shared hit=2228 read=2197
>>  Planning Time: 0.246 ms
>>  Execution Time: 224.036 ms
>> (6 rows)
>>
>> But when there is expression or subselect, then hashing doesn't work 
>> and query becomes slower.
>>
>> # explain analyze select * from test where val IN (5000, 4000, 9000, 
>> 2000, 1000, (select 140050));
>>                                               QUERY PLAN
>> -------------------------------------------------------------------------------------------------------
>>  Seq Scan on test  (cost=0.01..21925.01 rows=6 width=4) (actual 
>> time=0.904..396.495 rows=6.00 loops=1)
>>    Filter: (val = ANY (ARRAY[5000, 4000, 9000, 2000, 1000, (InitPlan 
>> 1).col1]))
>>    Rows Removed by Filter: 999994
>>    Buffers: shared hit=2292 read=2133
>>    InitPlan 1
>>      ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual 
>> time=0.002..0.002 rows=1.00 loops=1)
>>  Planning Time: 0.160 ms
>>  Execution Time: 396.538 ms
>> (8 rows)
>>
>> In contrast, hashing is always available with VALUES.
>>
>> # explain analyze select * from test where val in (VALUES (5000), 
>> (4000), (9000), (2000), (1000), ((select 140050)));
>>  QUERY PLAN
>> ------------------------------------------------------------------------------------------------------------------------
>>  Hash Semi Join  (cost=0.16..17050.23 rows=6 width=4) (actual 
>> time=1.589..225.061 rows=6.00 loops=1)
>>    Hash Cond: (test.val = "*VALUES*".column1)
>>    Buffers: shared hit=2356 read=2069
>>    InitPlan 1
>>      ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual 
>> time=0.003..0.003 rows=1.00 loops=1)
>>    ->  Seq Scan on test  (cost=0.00..14425.00 rows=1000000 width=4) 
>> (actual time=0.460..91.912 rows=1000000.00 loops=1)
>>          Buffers: shared hit=2356 read=2069
>>    ->  Hash  (cost=0.08..0.08 rows=6 width=4) (actual 
>> time=0.049..0.050 rows=6.00 loops=1)
>>          Buckets: 1024  Batches: 1  Memory Usage: 9kB
>>          ->  Values Scan on "*VALUES*"  (cost=0.00..0.08 rows=6 
>> width=4) (actual time=0.009..0.032 rows=6.00 loops=1)
>>  Planning Time: 0.627 ms
>>  Execution Time: 225.155 ms
>> (12 rows)
>>
>> I think we should allow our transformation only when the array is 
>> constant (attached patchset).
>
> Yes, I agree with your conclusions; however, I noticed that you didn’t 
> take Param-type variables into account.
> These still get executed during the VALUES -> ANY transformation (see 
> regression tests).
>
> +PREPARE test2 (int,numeric, text) AS
> +  SELECT ten FROM onek
> +  WHERE sin(two)*four/($3::real) IN (VALUES (2), ($2), ($2), ($1));
> +-- VTA forbidden because of unresolved casting of numeric parameter 
> to common type
> +EXPLAIN (COSTS OFF) EXECUTE test2(2, 2, '2');
> +                                                         QUERY PLAN
> +-----------------------------------------------------------------------------------------------------------------------------
> + Seq Scan on onek
> +   Filter: (((sin((two)::double precision) * (four)::double 
> precision) / '2'::real) = ANY ('{2,2,2,2}'::double precision[]))
> +(2 rows)
> +
> +PREPARE test3 (int,int, text) AS
> +  SELECT ten FROM onek
> +  WHERE sin(two)*four/($3::real) IN (VALUES (2), ($2), ($2), ($1));
> +EXPLAIN (COSTS OFF) EXECUTE test3(2, 2, '2');
> +                                                         QUERY PLAN
> +-----------------------------------------------------------------------------------------------------------------------------
> + Seq Scan on onek
> +   Filter: (((sin((two)::double precision) * (four)::double 
> precision) / '2'::real) = ANY ('{2,2,2,2}'::double precision[]))
> +(2 rows)
>
> In my opinion, we can apply the VALUES ->ANY transformation to them as 
> well. What do you think? I ran some queries and didn’t notice any 
> significant performance degradation.
>
> create table test (x int);
> insert into test select id from generate_series(1,1000) id;
> PREPARE test4 (int,int, int) AS select * from test where x IN ($1, $2, 
> $3);
> PREPARE test3 (int,int, int) AS select * from test where x IN ($1, $2,
>  (select $3));
> EXPLAIN ANALYZE EXECUTE test4(2, 2, 2);
>                                             QUERY PLAN
> --------------------------------------------------------------------------------------------------
>  Seq Scan on test  (cost=0.00..18.75 rows=3 width=4) (actual 
> time=0.016..0.353 rows=1.00 loops=1)
>    Filter: (x = ANY (ARRAY[$1, $2, $3]))
>    Rows Removed by Filter: 999
>    Buffers: shared hit=5
>  Planning:
>    Buffers: shared hit=20
>  Planning Time: 0.266 ms
>  Execution Time: 0.367 ms
> (8 rows)
>
> alena(at)postgres=# EXPLAIN ANALYZE EXECUTE test3(2, 2, 2);
>                                             QUERY PLAN
> --------------------------------------------------------------------------------------------------
>  Seq Scan on test  (cost=0.01..18.76 rows=3 width=4) (actual 
> time=0.072..1.379 rows=1.00 loops=1)
>    Filter: (x = ANY (ARRAY[2, 2, (InitPlan 1).col1]))
>    Rows Removed by Filter: 999
>    Buffers: shared hit=5
>    InitPlan 1
>      ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual 
> time=0.003..0.003 rows=1.00 loops=1)
>  Planning Time: 0.350 ms
>  Execution Time: 1.431 ms
> (8 rows)
>
>
> alena(at)postgres=# PREPARE test6 (int,int, int) AS select * from test 
> where x IN (values($1), ($2), ($3));
> PREPARE
> alena(at)postgres=# EXPLAIN ANALYZE EXECUTE test6(2, 2, 2);
>                                             QUERY PLAN
> --------------------------------------------------------------------------------------------------
>  Seq Scan on test  (cost=0.00..18.75 rows=3 width=4) (actual 
> time=0.055..0.683 rows=1.00 loops=1)
>    Filter: (x = ANY ('{2,2,2}'::integer[]))
>    Rows Removed by Filter: 999
>    Buffers: shared hit=5
>  Planning Time: 0.230 ms
>  Execution Time: 0.724 ms
> (6 rows)
>
> We can’t use hashing for them, but without this transformation, we 
> still have to perform a join.
>
> ----------------------------------------------------------------------------------------------------------------------
>  Hash Semi Join  (cost=0.08..17.73 rows=3 width=4) (actual 
> time=0.124..0.943 rows=1.00 loops=1)
>    Hash Cond: (test.x = "*VALUES*".column1)
>    Buffers: shared hit=5
>    ->  Seq Scan on test  (cost=0.00..15.00 rows=1000 width=4) (actual 
> time=0.051..0.389 rows=1000.00 loops=1)
>          Buffers: shared hit=5
>    ->  Hash  (cost=0.04..0.04 rows=3 width=4) (actual 
> time=0.028..0.030 rows=3.00 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 9kB
>          ->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=3 
> width=4) (actual time=0.004..0.010 rows=3.00 loops=1)
>  Planning:
>    Buffers: shared hit=105 read=1
>  Planning Time: 2.176 ms
>  Execution Time: 1.077 ms
> (12 rows)
>
> So, I think we can bring it back and construct the Array node based on 
> the have_param flag.
>
> foreach (lc, rte->values_lists)
> +    {
> +        List *elem = lfirst(lc);
> +        Node *value = linitial(elem);
> +
> +        value = eval_const_expressions(NULL, value);
> +
> +        if (!IsA(value, Const))
> +            have_param = true;
> +
> +        consts = lappend(consts, value);
> +
> +    }
>
> Regarding the check for the presence of Var elements before the 
> transformation, I think we should, for now, restore the walker 
> function (values_simplicity_check_walker) that
> traverses the query to identify Var nodes. This function was included 
> in the initial version of the patch:
>
> +/*
> + * The function traverses the tree looking for elements of type var.
> + * If it finds it, it returns true.
> + */
> +static bool
> +values_simplicity_check_walker(Node *node, void *ctx)
> +{
> +    if (node == NULL)
> +    {
> +        return false;
> +    }
> +    else if(IsA(node, Var))
> +        return true;
> +    else if(IsA(node, Query))
> +        return query_tree_walker((Query *) node,
> +  values_simplicity_check_walker,
> +                                 (void*) ctx,
> +                                 QTW_EXAMINE_RTES_BEFORE);
> +
> +    return expression_tree_walker(node, values_simplicity_check_walker,
> +                                  (void *) ctx);
> +}
>
>> In future we may implement dynamic SAOP hashing, and then allow our 
>> transformation in more cases.
> I agree with your suggestion) Thank you for your interest to this 
> subject and contribution!
I prepared a patch according to my suggestions, it just checks that the 
transformation is not carried out if there is a var element, there are 
changes only in one test, but I think it is correct.
diff -U3 
/home/alena/postgrespro_or3/src/test/regress/expected/subselect.out 
/home/alena/postgrespro_or3/src/test/regress/results/subselect.out
--- /home/alena/postgrespro_or3/src/test/regress/expected/subselect.out 
2025-04-02 02:50:07.018329864 +0300
+++ /home/alena/postgrespro_or3/src/test/regress/results/subselect.out 
2025-04-02 17:27:09.845104001 +0300
@@ -3027,18 +3027,15 @@
  SELECT ten FROM onek t WHERE unique1 IN (VALUES (0), ((2 IN
    (SELECT (3)))::integer)
  );
-                     QUERY PLAN
-----------------------------------------------------
- Nested Loop
-   ->  Unique
-         ->  Sort
-               Sort Key: "*VALUES*".column1
-               ->  Values Scan on "*VALUES*"
-                     SubPlan 1
-                       ->  Result
-   ->  Index Scan using onek_unique1 on onek t
-         Index Cond: (unique1 = "*VALUES*".column1)
-(9 rows)
+                                           QUERY PLAN
+------------------------------------------------------------------------------------------------
+ Bitmap Heap Scan on onek t
+   Recheck Cond: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed SubPlan 
1).col1)))::integer]))
+   ->  Bitmap Index Scan on onek_unique1
+         Index Cond: (unique1 = ANY (ARRAY[0, ((ANY (2 = (hashed 
SubPlan 1).col1)))::integer]))
+   SubPlan 1
+     ->  Result
+(6 rows)
  -- Alow to transformation and hold conversion between types of colemns and
  -- declared type of column pointed in RTE
-- 
Regards,
Alena Rybakina
Postgres Professional
| Attachment | Content-Type | Size | 
|---|---|---|
| values_to_any.diff.no-cfbot | text/plain | 3.1 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Robert Haas | 2025-04-02 14:40:38 | Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints | 
| Previous Message | George MacKerron | 2025-04-02 14:15:24 | Re: Making sslrootcert=system work on Windows psql |