Planner picks n² query plan when available

From: Toto guyoyg <thomas(dot)bessou(at)hotmail(dot)fr>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Planner picks n² query plan when available
Date: 2024-11-20 10:11:04
Message-ID: PA4P189MB12798715260E08EC020EBA5985212@PA4P189MB1279.EURP189.PROD.OUTLOOK.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Offending O(n²) query:

```sql
SELECT id FROM indexed_table WHERE indexed_value = ANY (ARRAY[1,2,...])
```

I'm not posting this on the `pgsql-performance` mailing list because this is about fixing the issue, not working around it.
I'm not posting this on the `pgsql-bugs` mailing list because documentation states that performance issues are not bugs.
Please let me know if I'm not posting to the appropriate place.

From what I could investigate so far, it looks like:

1. When performing bitmap heap scan, if recheck is activated, that is O(n²) because it rechecks linearly in the array.
2. If using a Hash index, because hash indexes only point to entries that match the hash but the values aren't stored in the index and there may be hash collisions, so bitmap index scan on a Hash index forces the parent bitmap heap scan to perform a recheck. That recheck leads to the O(n²) of 1.
3. When a hash index is available, Postgres tends to prefer using it over Btree index, ignoring the two above properties.
4. The EXPLAIN ANALYZE output shows that the planner always supposes that arrays are of size 10, instead of using the estimated sizes of subqueries they are created from, or actual size provided as argument. This maybe causes 3.
5. If using a Btree index but work_mem is not sufficient for the heap read to be exact (as in Heap Blocks: exact=510 lossy=43738 as output of EXPLAIN (ANALYZE, BUFFERS)), this will also activate recheck, causing O(n²).

It looks like this could be improved/fixed by either/all of:

1. Using a hashset (or sort + binary search) for recheck (past a certain array length or even always) instead of always searching linearly
2. Fixing planner assuming that all arrays are of size 10, using instead actual or estimated sizes. Taking it into account for recheck cond cost estimation. (This wouldn't solve recheck perf, but at least would make the planner aware that lossy heap scans and Hash index bitmap heap scans are n times more expensive than exact BTree bitmap heap scans, and hopefully make it avoid picking them.)

Is my understanding correct?

Thanks,

PG Version: PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.0, 64-bit

Reproducer:

```sql
BEGIN;
CREATE TABLE tmp_benchmark(
id Int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
value text NOT NULL
);
-- Insert 1M values
INSERT INTO tmp_benchmark(value) SELECT DISTINCT (floor(random() * 1000000000000)::int8)::text FROM generate_series(1,1000000);

ALTER TABLE tmp_benchmark ADD CONSTRAINT tmp_benchmark_unique_value_and_corresponding_index UNIQUE(value);

CREATE INDEX tmp_benchmark_benchmark_value_hash ON tmp_benchmark USING HASH (value);

ANALYZE tmp_benchmark;

-- Randomly sample 50k of those values
CREATE TEMPORARY TABLE lookup_test ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 50000;

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test);
/* Gives:
Nested Loop (cost=864.00..2376.52 rows=43520 width=20) (actual time=6.890..51.473 rows=50000 loops=1)
Buffers: shared hit=102085, local hit=320
-> HashAggregate (cost=864.00..866.00 rows=200 width=32) (actual time=6.879..11.140 rows=50000 loops=1)
Group Key: lookup_test.value
Batches: 1 Memory Usage: 3617kB
Buffers: local hit=320
-> Seq Scan on lookup_test (cost=0.00..755.20 rows=43520 width=32) (actual time=0.003..1.702 rows=50000 loops=1)
Buffers: local hit=320
-> Index Scan using tmp_benchmark_benchmark_value_hash on tmp_benchmark (cost=0.00..7.88 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=50000)
Index Cond: (value = lookup_test.value)
Rows Removed by Index Recheck: 0
Buffers: shared hit=102085
Planning:
Buffers: shared hit=18 read=1
Planning Time: 0.196 ms
Execution Time: 52.520 ms
*/
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=904.09..943.13 rows=10 width=20) (actual time=22.804..13931.788 rows=50000 loops=1)
Recheck Cond: (value = ANY ($0))
Rows Removed by Index Recheck: 15
Heap Blocks: exact=6369
Buffers: shared hit=68885, local hit=320
InitPlan 1 (returns $0)
-> Aggregate (cost=864.00..864.01 rows=1 width=32) (actual time=4.784..4.785 rows=1 loops=1)
Buffers: local hit=320
-> Seq Scan on lookup_test (cost=0.00..755.20 rows=43520 width=32) (actual time=0.007..2.040 rows=50000 loops=1)
Buffers: local hit=320
-> Bitmap Index Scan on tmp_benchmark_benchmark_value_hash (cost=0.00..40.08 rows=10 width=0) (actual time=22.166..22.166 rows=50015 loops=1)
Index Cond: (value = ANY ($0))
Buffers: shared hit=62516, local hit=320
Planning:
Buffers: shared hit=9
Planning Time: 0.112 ms
Execution Time: 13933.270 ms
*/

-- Randomly sample 100k of those values

CREATE TEMPORARY TABLE lookup_test_2 ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 100000;

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test_2);
/* Gives:
Nested Loop (cost=1555.20..3025.00 rows=78336 width=20) (actual time=15.661..104.241 rows=100000 loops=1)
Buffers: shared hit=204153, local hit=576, temp read=20 written=44
-> HashAggregate (cost=1555.20..1557.20 rows=200 width=32) (actual time=15.653..29.110 rows=100000 loops=1)
Group Key: lookup_test_2.value
Batches: 5 Memory Usage: 6977kB Disk Usage: 232kB
Buffers: local hit=576, temp read=20 written=44
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..3.692 rows=100000 loops=1)
Buffers: local hit=576
-> Index Scan using tmp_benchmark_benchmark_value_hash on tmp_benchmark (cost=0.00..7.88 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=100000)
Index Cond: (value = lookup_test_2.value)
Rows Removed by Index Recheck: 0
Buffers: shared hit=204153
Planning:
Buffers: shared hit=4
Planning Time: 0.146 ms
Execution Time: 106.359 ms
*/
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=1595.29..1634.33 rows=10 width=20) (actual time=45.983..55338.490 rows=100000 loops=1)
Recheck Cond: (value = ANY ($0))
Rows Removed by Index Recheck: 21
Heap Blocks: exact=6370
Buffers: shared hit=129063 read=2197, local hit=576
InitPlan 1 (returns $0)
-> Aggregate (cost=1555.20..1555.21 rows=1 width=32) (actual time=8.863..8.864 rows=1 loops=1)
Buffers: local hit=576
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..3.696 rows=100000 loops=1)
Buffers: local hit=576
-> Bitmap Index Scan on tmp_benchmark_benchmark_value_hash (cost=0.00..40.08 rows=10 width=0) (actual time=45.436..45.436 rows=100025 loops=1)
Index Cond: (value = ANY ($0))
Buffers: shared hit=124890, local hit=576
Planning Time: 0.089 ms
Execution Time: 55342.836 ms
*/

-- Note how with 2x the number of values, execution time is x4 the above. Monitoring CPU while this query runs shows 100% CPU, unlike the subselect-based variant.

DROP INDEX tmp_benchmark_benchmark_value_hash;
-- Once we remove the hash index, even with the same 100k sample and bitmap heap scan plan it doesn't do n²

-- On the same 100k sample:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=1599.54..1638.58 rows=10 width=20) (actual time=234.701..256.963 rows=100000 loops=1)
Recheck Cond: (value = ANY ($0))
Heap Blocks: exact=6370
Buffers: shared hit=296148 read=10225 written=767, local hit=576
InitPlan 1 (returns $0)
-> Aggregate (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.425..9.426 rows=1 loops=1)
Buffers: local hit=576
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..4.001 rows=100000 loops=1)
Buffers: local hit=576
-> Bitmap Index Scan on tmp_benchmark_unique_value_and_corresponding_index (cost=0.00..44.33 rows=10 width=0) (actual time=234.264..234.264 rows=100000 loops=1)
Index Cond: (value = ANY ($0))
Buffers: shared hit=296148 read=3855 written=763, local hit=576
Planning:
Buffers: shared hit=8 read=3
Planning Time: 0.169 ms
Execution Time: 259.087 ms
*/

-- So despite doing a bitmap heap scan on the same 100k elements array as above, this one seemingly doesn't n².

-- With less work_mem to lead even the BTree index scan to perform a recheck:
SET LOCAL work_mem TO '64kB';
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark (cost=1599.54..1638.58 rows=10 width=20) (actual time=248.154..974679.087 rows=100000 loops=1)
Recheck Cond: (value = ANY ($0))
Rows Removed by Index Recheck: 802166
Heap Blocks: exact=683 lossy=5687
Buffers: shared hit=300000 read=6370, local hit=576
InitPlan 1 (returns $0)
-> Aggregate (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.274..9.276 rows=1 loops=1)
Buffers: local hit=576
-> Seq Scan on lookup_test_2 (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.009..3.944 rows=100000 loops=1)
Buffers: local hit=576
-> Bitmap Index Scan on tmp_benchmark_unique_value_and_corresponding_index (cost=0.00..44.33 rows=10 width=0) (actual time=234.699..234.699 rows=100000 loops=1)
Index Cond: (value = ANY ($0))
Buffers: shared hit=300000, local hit=576
Planning Time: 0.099 ms
Execution Time: 974687.222 ms
*/
-- => It's clear that the recheck is n² when performed

ROLLBACK;
```

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-11-20 10:18:57 Re: Document NULL
Previous Message Andrey M. Borodin 2024-11-20 09:40:09 Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID