Quick Links

Inefficient query plan for SELECT ... EXCEPT ...

From:	Dimitrios Apostolou <jimis(at)gmx(dot)net>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Inefficient query plan for SELECT ... EXCEPT ...
Date:	2023-10-31 22:41:33
Message-ID:	86c352b6-c0ec-8c4d-9506-297506b1b5da@gmx.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hello list,

I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query,
where the left side is a very short table (even zero-length sometimes, but
also also rarely can be as long as 200K rows), and the right side is a
table with 10M UNIQUE NOT NULL rows:

Follows a simplified and pathological case, that takes 1.7s even though
the left side is empty:

> BEGIN;
> CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP;
> ANALYZE VERBOSE tmp_table ;
INFO: analyzing "pg_temp_9.tmp_table"
INFO: "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
> EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS)
SELECT DISTINCT d1
FROM tmp_table
WHERE d1 IS NOT NULL
EXCEPT
SELECT test_datatag FROM test_datatags;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
HashSetOp Except (cost=0.00..299854.08 rows=1 width=36) (actual time=1726.470..1726.472 rows=0 loops=1)
Output: "*SELECT* 1".d1, (0)
Buffers: shared hit=1259 read=58800
I/O Timings: shared/local read=77.713
-> Append (cost=0.00..278054.53 rows=8719821 width=36) (actual time=3.754..1287.901 rows=8702840 loops=1)
Buffers: shared hit=1259 read=58800
I/O Timings: shared/local read=77.713
-> Subquery Scan on "*SELECT* 1" (cost=0.00..0.02 rows=1 width=36) (actual time=0.003..0.003 rows=0 loops=1)
Output: "*SELECT* 1".d1, 0
-> HashAggregate (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=1)
Output: tmp_table.d1
Group Key: tmp_table.d1
Batches: 1 Memory Usage: 24kB
-> Seq Scan on pg_temp.tmp_table (cost=0.00..0.00 rows=1 width=32) (actual time=0.001..0.001 rows=0 loops=1)
Output: tmp_table.d1
Filter: (tmp_table.d1 IS NOT NULL)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..234455.40 rows=8719820 width=26) (actual time=3.751..943.850 rows=8702840 loops=1)
Output: "*SELECT* 2".test_datatag, 1
Buffers: shared hit=1259 read=58800
I/O Timings: shared/local read=77.713
-> Seq Scan on public.test_datatags (cost=0.00..147257.20 rows=8719820 width=22) (actual time=3.747..515.420 rows=8702840 loops=1)
Output: test_datatags.test_datatag
Buffers: shared hit=1259 read=58800
I/O Timings: shared/local read=77.713
Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 'on', enable_partitionwise_join = 'on', hash_mem_multiplier = '1',
maintenance_io_concurrency = '0', max_parallel_workers = '12', max_parallel_workers_per_gather = '0', temp_buffers = '64MB'
Planning:
Buffers: shared hit=2
Planning Time: 0.055 ms
JIT:
Functions: 15
Options: Inlining false, Optimization false, Expressions true,
Deforming true
Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms,
Emission 3.542 ms, Total 4.037 ms
Execution Time: 1726.835 ms
(33 rows)

I'm wondering why the planner doesn't see that the left table is very small and follow a different path.
From an abstract computer science POV, I would

1. sort the left table (the right one is already indexed)
2. "merge" the two tables, by walking them in-order in parallel and excluding the matches
3. stop when the left table is exhausted, which would happen very early.

Is this worth a bug report? I can file one if the issue is not known.
Or am I misunderstanding the implications of the SELECT-EXCEPT query?

In the meantime I have replaced the query with a LEFT OUTER JOIN which
performs much better, and I believe is equivalent. I find it less readable
than the query in question though. Plus, I have a bunch of SELECT-EXCEPT
queries (with smaller right-side tables) in my application that I would
hate to change them all to the ugliest equivalent. Under what conditions
would the above query plan perform well?

Thanks in advance,
Dimitris

Responses

Re: Inefficient query plan for SELECT ... EXCEPT ... at 2023-10-31 22:56:01 from David G. Johnston
Re: Inefficient query plan for SELECT ... EXCEPT ... at 2023-10-31 22:57:03 from David Rowley

Browse pgsql-general by date

	From	Date	Subject
Next Message	David G. Johnston	2023-10-31 22:56:01	Re: Inefficient query plan for SELECT ... EXCEPT ...
Previous Message	veem v	2023-10-31 21:49:29	Question on Aurora postgres