From: | Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | Andrei Lepikhov <lepihov(at)gmail(dot)com>, Ivan Kush <ivan(dot)kush(at)tantorlabs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> |
Subject: | Re: Replace IN VALUES with ANY in WHERE clauses during optimization |
Date: | 2025-04-01 14:23:02 |
Message-ID: | 11eac8f5-18e1-42ba-a989-23e8e2346860@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, Alexander!
On 01.04.2025 15:07, Alexander Korotkov wrote:
> Hi, Alena!
>
> On Tue, Apr 1, 2025 at 2:11 AM Alena Rybakina
> <a(dot)rybakina(at)postgrespro(dot)ru> wrote:
>
> 4.1) explain analyze SELECT ten
>
> FROM onek t WHERE unique1 IN ( VALUES (0), ((2 IN ( SELECT unique2
> FROM onek c WHERE c.unique2 in ((values(0),(2))))::integer)) );
>
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------
> Seq Scan on onek t (cost=180.11..410.25 rows=2 width=6) (actual
> time=5.014..13.256 rows=3.00 loops=1) Filter: (unique1 = ANY
> (ARRAY[0, ((ANY (2 = (hashed SubPlan 1).col1)))::integer])) Rows
> Removed by Filter: 10005 *Buffers: shared hit=110* SubPlan 1 ->
> Seq Scan on onek c (cost=0.00..180.10 rows=3 width=4) (actual
> time=0.022..4.951 rows=2.00 loops=1) Filter: (unique2 = ANY
> ('{0,2}'::integer[])) Rows Removed by Filter: 10006 *Buffers:
> shared hit=55* Planning: Buffers: shared hit=6 dirtied=1 Planning
> Time: 0.502 ms Execution Time: 13.348 ms (13 rows)
>
> The query plan without our patch:
>
> --------------------------------------------------------------------------------------------------------------------------------------------
> Hash Semi Join (cost=0.05..181.42 rows=2 width=6) (actual
> time=5.072..9.076 rows=3.00 loops=1) Hash Cond: (t.unique1 =
> "*VALUES*".column1) *Buffers: shared hit=55 read=55* -> Seq Scan
> on onek t (cost=0.00..155.08 rows=10008 width=10) (actual
> time=0.145..1.802 rows=10008.00 loops=1) *Buffers: shared hit=52
> read=3* -> Hash (cost=0.03..0.03 rows=2 width=4) (actual
> time=4.908..4.912 rows=2.00 loops=1) Buckets: 1024 Batches: 1
> Memory Usage: 9kB *Buffers: shared hit=3 read=52* -> Values Scan
> on "*VALUES*" (cost=0.00..0.03 rows=2 width=4) (actual
> time=0.003..4.901 rows=2.00 loops=1) *Buffers: shared hit=3
> read=52* SubPlan 1 -> Hash Semi Join (cost=0.05..181.42 rows=2
> width=4) (actual time=0.036..4.861 rows=2.00 loops=1) Hash Cond:
> (c.unique2 = "*VALUES*_1".column1) *Buffers: shared hit=3 read=52*
> -> Seq Scan on onek c (cost=0.00..155.08 rows=10008 width=4)
> (actual time=0.009..2.120 rows=10008.00 loops=1) *Buffers: shared
> hit=3 read=52* -> Hash (cost=0.03..0.03 rows=2 width=4) (actual
> time=0.006..0.008 rows=2.00 loops=1) Buckets: 1024 Batches: 1
> Memory Usage: 9kB -> Values Scan on "*VALUES*_1" (cost=0.00..0.03
> rows=2 width=4) (actual time=0.001..0.002 rows=2.00 loops=1)
> Planning: Buffers: shared hit=102 read=22 Planning Time: 1.853 ms
> Execution Time: 9.281 ms (23 rows)
>
>
> I think I managed to understand what is going on.
>
> When we run a query with SOAP over a constant array
> then convert_saop_to_hashed_saop_walker() provides acceleration with
> hashing.
>
> # explain analyze select * from test where val IN (5000, 4000, 9000,
> 2000, 1000, 140050);
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------
> Seq Scan on test (cost=0.00..21925.00 rows=6 width=4) (actual
> time=2.015..223.984 rows=6.00 loops=1)
> Filter: (val = ANY ('{5000,4000,9000,2000,1000,140050}'::integer[]))
> Rows Removed by Filter: 999994
> Buffers: shared hit=2228 read=2197
> Planning Time: 0.246 ms
> Execution Time: 224.036 ms
> (6 rows)
>
> But when there is expression or subselect, then hashing doesn't work
> and query becomes slower.
>
> # explain analyze select * from test where val IN (5000, 4000, 9000,
> 2000, 1000, (select 140050));
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------
> Seq Scan on test (cost=0.01..21925.01 rows=6 width=4) (actual
> time=0.904..396.495 rows=6.00 loops=1)
> Filter: (val = ANY (ARRAY[5000, 4000, 9000, 2000, 1000, (InitPlan
> 1).col1]))
> Rows Removed by Filter: 999994
> Buffers: shared hit=2292 read=2133
> InitPlan 1
> -> Result (cost=0.00..0.01 rows=1 width=4) (actual
> time=0.002..0.002 rows=1.00 loops=1)
> Planning Time: 0.160 ms
> Execution Time: 396.538 ms
> (8 rows)
>
> In contrast, hashing is always available with VALUES.
>
> # explain analyze select * from test where val in (VALUES (5000),
> (4000), (9000), (2000), (1000), ((select 140050)));
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------
> Hash Semi Join (cost=0.16..17050.23 rows=6 width=4) (actual
> time=1.589..225.061 rows=6.00 loops=1)
> Hash Cond: (test.val = "*VALUES*".column1)
> Buffers: shared hit=2356 read=2069
> InitPlan 1
> -> Result (cost=0.00..0.01 rows=1 width=4) (actual
> time=0.003..0.003 rows=1.00 loops=1)
> -> Seq Scan on test (cost=0.00..14425.00 rows=1000000 width=4)
> (actual time=0.460..91.912 rows=1000000.00 loops=1)
> Buffers: shared hit=2356 read=2069
> -> Hash (cost=0.08..0.08 rows=6 width=4) (actual
> time=0.049..0.050 rows=6.00 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 9kB
> -> Values Scan on "*VALUES*" (cost=0.00..0.08 rows=6
> width=4) (actual time=0.009..0.032 rows=6.00 loops=1)
> Planning Time: 0.627 ms
> Execution Time: 225.155 ms
> (12 rows)
>
> I think we should allow our transformation only when the array is
> constant (attached patchset).
Yes, I agree with your conclusions; however, I noticed that you didn’t
take Param-type variables into account.
These still get executed during the VALUES -> ANY transformation (see
regression tests).
+PREPARE test2 (int,numeric, text) AS
+ SELECT ten FROM onek
+ WHERE sin(two)*four/($3::real) IN (VALUES (2), ($2), ($2), ($1));
+-- VTA forbidden because of unresolved casting of numeric parameter to
common type
+EXPLAIN (COSTS OFF) EXECUTE test2(2, 2, '2');
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------
+ Seq Scan on onek
+ Filter: (((sin((two)::double precision) * (four)::double precision)
/ '2'::real) = ANY ('{2,2,2,2}'::double precision[]))
+(2 rows)
+
+PREPARE test3 (int,int, text) AS
+ SELECT ten FROM onek
+ WHERE sin(two)*four/($3::real) IN (VALUES (2), ($2), ($2), ($1));
+EXPLAIN (COSTS OFF) EXECUTE test3(2, 2, '2');
+ QUERY PLAN
+-----------------------------------------------------------------------------------------------------------------------------
+ Seq Scan on onek
+ Filter: (((sin((two)::double precision) * (four)::double precision)
/ '2'::real) = ANY ('{2,2,2,2}'::double precision[]))
+(2 rows)
In my opinion, we can apply the VALUES ->ANY transformation to them as
well. What do you think? I ran some queries and didn’t notice any
significant performance degradation.
create table test (x int);
insert into test select id from generate_series(1,1000) id;
PREPARE test4 (int,int, int) AS select * from test where x IN ($1, $2, $3);
PREPARE test3 (int,int, int) AS select * from test where x IN ($1, $2,
(select $3));
EXPLAIN ANALYZE EXECUTE test4(2, 2, 2);
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..18.75 rows=3 width=4) (actual
time=0.016..0.353 rows=1.00 loops=1)
Filter: (x = ANY (ARRAY[$1, $2, $3]))
Rows Removed by Filter: 999
Buffers: shared hit=5
Planning:
Buffers: shared hit=20
Planning Time: 0.266 ms
Execution Time: 0.367 ms
(8 rows)
alena(at)postgres=# EXPLAIN ANALYZE EXECUTE test3(2, 2, 2);
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.01..18.76 rows=3 width=4) (actual
time=0.072..1.379 rows=1.00 loops=1)
Filter: (x = ANY (ARRAY[2, 2, (InitPlan 1).col1]))
Rows Removed by Filter: 999
Buffers: shared hit=5
InitPlan 1
-> Result (cost=0.00..0.01 rows=1 width=4) (actual
time=0.003..0.003 rows=1.00 loops=1)
Planning Time: 0.350 ms
Execution Time: 1.431 ms
(8 rows)
alena(at)postgres=# PREPARE test6 (int,int, int) AS select * from test
where x IN (values($1), ($2), ($3));
PREPARE
alena(at)postgres=# EXPLAIN ANALYZE EXECUTE test6(2, 2, 2);
QUERY PLAN
--------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..18.75 rows=3 width=4) (actual
time=0.055..0.683 rows=1.00 loops=1)
Filter: (x = ANY ('{2,2,2}'::integer[]))
Rows Removed by Filter: 999
Buffers: shared hit=5
Planning Time: 0.230 ms
Execution Time: 0.724 ms
(6 rows)
We can’t use hashing for them, but without this transformation, we still
have to perform a join.
----------------------------------------------------------------------------------------------------------------------
Hash Semi Join (cost=0.08..17.73 rows=3 width=4) (actual
time=0.124..0.943 rows=1.00 loops=1)
Hash Cond: (test.x = "*VALUES*".column1)
Buffers: shared hit=5
-> Seq Scan on test (cost=0.00..15.00 rows=1000 width=4) (actual
time=0.051..0.389 rows=1000.00 loops=1)
Buffers: shared hit=5
-> Hash (cost=0.04..0.04 rows=3 width=4) (actual time=0.028..0.030
rows=3.00 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=3
width=4) (actual time=0.004..0.010 rows=3.00 loops=1)
Planning:
Buffers: shared hit=105 read=1
Planning Time: 2.176 ms
Execution Time: 1.077 ms
(12 rows)
So, I think we can bring it back and construct the Array node based on
the have_param flag.
foreach (lc, rte->values_lists)
+ {
+ List *elem = lfirst(lc);
+ Node *value = linitial(elem);
+
+ value = eval_const_expressions(NULL, value);
+
+ if (!IsA(value, Const))
+ have_param = true;
+
+ consts = lappend(consts, value);
+
+ }
Regarding the check for the presence of Var elements before the
transformation, I think we should, for now, restore the walker function
(values_simplicity_check_walker) that
traverses the query to identify Var nodes. This function was included in
the initial version of the patch:
+/*
+ * The function traverses the tree looking for elements of type var.
+ * If it finds it, it returns true.
+ */
+static bool
+values_simplicity_check_walker(Node *node, void *ctx)
+{
+ if (node == NULL)
+ {
+ return false;
+ }
+ else if(IsA(node, Var))
+ return true;
+ else if(IsA(node, Query))
+ return query_tree_walker((Query *) node,
+ values_simplicity_check_walker,
+ (void*) ctx,
+ QTW_EXAMINE_RTES_BEFORE);
+
+ return expression_tree_walker(node, values_simplicity_check_walker,
+ (void *) ctx);
+}
> In future we may implement dynamic SAOP hashing, and then allow our
> transformation in more cases.
I agree with your suggestion) Thank you for your interest to this
subject and contribution!
--
Regards,
Alena Rybakina
Postgres Professional
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2025-04-01 14:27:24 | Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query). |
Previous Message | Andrew Jackson | 2025-04-01 14:19:31 | Re: Update LDAP Protocol in fe-connect.c to v3 |