From: | Dimitrios Apostolou <jimis(at)gmx(dot)net> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Inefficient query plan for SELECT ... EXCEPT ... |
Date: | 2023-10-31 22:41:33 |
Message-ID: | 86c352b6-c0ec-8c4d-9506-297506b1b5da@gmx.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello list,
I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query,
where the left side is a very short table (even zero-length sometimes, but
also also rarely can be as long as 200K rows), and the right side is a
table with 10M UNIQUE NOT NULL rows:
> \d test_datatags
Table "public.test_datatags"
Column | Type | Collation | Nullable | Default
----------------+---------+-----------+----------+----------------------------------
test_datatag_n | integer | | not null | generated by default as identity
test_datatag | text | | not null |
Indexes:
"test_datatags_pkey" PRIMARY KEY, btree (test_datatag_n)
"test_datatags_test_datatag_key" UNIQUE CONSTRAINT, btree (test_datatag)
Follows a simplified and pathological case, that takes 1.7s even though
the left side is empty:
> BEGIN;
> CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP;
> ANALYZE VERBOSE tmp_table ;
INFO: analyzing "pg_temp_9.tmp_table"
INFO: "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
> EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS)
SELECT DISTINCT d1
FROM tmp_table
WHERE d1 IS NOT NULL
EXCEPT
SELECT test_datatag FROM test_datatags;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..299854.08 rows=1 width=36) (actual time=1726.470..1726.472 rows=0 loops=1)
Output: "*SELECT* 1".d1, (0)
Buffers: shared hit=1259 read=58800
I/O Timings: shared/local read=77.713
-> Append (cost=0.00..278054.53 rows=8719821 width=36) (actual time=3.754..1287.901 rows=8702840 loops=1)
Buffers: shared hit=1259 read=58800
I/O Timings: shared/local read=77.713
-> Subquery Scan on "*SELECT* 1" (cost=0.00..0.02 rows=1 width=36) (actual time=0.003..0.003 rows=0 loops=1)
Output: "*SELECT* 1".d1, 0
-> HashAggregate (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=1)
Output: tmp_table.d1
Group Key: tmp_table.d1
Batches: 1 Memory Usage: 24kB
-> Seq Scan on pg_temp.tmp_table (cost=0.00..0.00 rows=1 width=32) (actual time=0.001..0.001 rows=0 loops=1)
Output: tmp_table.d1
Filter: (tmp_table.d1 IS NOT NULL)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..234455.40 rows=8719820 width=26) (actual time=3.751..943.850 rows=8702840 loops=1)
Output: "*SELECT* 2".test_datatag, 1
Buffers: shared hit=1259 read=58800
I/O Timings: shared/local read=77.713
-> Seq Scan on public.test_datatags (cost=0.00..147257.20 rows=8719820 width=22) (actual time=3.747..515.420 rows=8702840 loops=1)
Output: test_datatags.test_datatag
Buffers: shared hit=1259 read=58800
I/O Timings: shared/local read=77.713
Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 'on', enable_partitionwise_join = 'on', hash_mem_multiplier = '1',
maintenance_io_concurrency = '0', max_parallel_workers = '12', max_parallel_workers_per_gather = '0', temp_buffers = '64MB'
Planning:
Buffers: shared hit=2
Planning Time: 0.055 ms
JIT:
Functions: 15
Options: Inlining false, Optimization false, Expressions true,
Deforming true
Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms,
Emission 3.542 ms, Total 4.037 ms
Execution Time: 1726.835 ms
(33 rows)
I'm wondering why the planner doesn't see that the left table is very small and follow a different path.
From an abstract computer science POV, I would
1. sort the left table (the right one is already indexed)
2. "merge" the two tables, by walking them in-order in parallel and excluding the matches
3. stop when the left table is exhausted, which would happen very early.
Is this worth a bug report? I can file one if the issue is not known.
Or am I misunderstanding the implications of the SELECT-EXCEPT query?
In the meantime I have replaced the query with a LEFT OUTER JOIN which
performs much better, and I believe is equivalent. I find it less readable
than the query in question though. Plus, I have a bunch of SELECT-EXCEPT
queries (with smaller right-side tables) in my application that I would
hate to change them all to the ugliest equivalent. Under what conditions
would the above query plan perform well?
Thanks in advance,
Dimitris
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2023-10-31 22:56:01 | Re: Inefficient query plan for SELECT ... EXCEPT ... |
Previous Message | veem v | 2023-10-31 21:49:29 | Question on Aurora postgres |