Quick Links

[PATCH] Improve selectivity estimation for OR clauses with equality conditions on the same column

From:	Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	[PATCH] Improve selectivity estimation for OR clauses with equality conditions on the same column
Date:	2025-03-03 15:57:17
Message-ID:	3da4338d-a1b2-4a84-a569-51b97e6e8ac3@tantorlabs.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

hi hackers,

When estimating the selectivity of an OR clause without extended
statistics, we assume that the conditions are independent. However, it
is quite common to see queries like (a = 1 OR a = 2). In such cases, the
assumption of independence can lead to a significant underestimation of
selectivity

Even when comparing OR conditions to IN, the results can differ
significantly. This issue is particularly noticeable when the number of
predicates in the OR clause is small, the table is large, and ndistinct
is low. Here’s an example of such a query on a large table with low
ndistinct:

CREATE TABLE test_table ( id SERIAL PRIMARY KEY, a INT );
INSERT INTO test_table (a) SELECT 1 FROM generate_series(1, 500000); --
500 000 rows with a = 1
INSERT INTO test_table (a) SELECT 2 FROM generate_series(1, 250000); --
250 000 rows with a = 2
INSERT INTO test_table (a) SELECT 3 FROM generate_series(1, 250000); --
250 000 rows with a = 3
ANALYZE test_table;

EXPLAIN ANALYZE SELECT * FROM test_table WHERE a IN (1, 2);
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Seq Scan on test_table (cost=0.00..16925.00 rows=746800 width=8)
(actual time=0.055..50.064 rows=750000.00 loops=1)
   Filter: (a = ANY ('{1,2}'::integer[]))
   Rows Removed by Filter: 250000
   Buffers: shared hit=32 read=4393
Planning:
   Buffers: shared hit=33 read=15
Planning Time: 1.056 ms
Execution Time: 64.800 ms
(8 rows)

EXPLAIN ANALYZE SELECT * FROM test_table WHERE a = 1 OR a = 2;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Seq Scan on test_table (cost=0.00..19425.00 rows=622674 width=8)
(actual time=0.029..40.451 rows=750000.00 loops=1)
   Filter: ((a = 1) OR (a = 2))
   Rows Removed by Filter: 250000
   Buffers: shared hit=4425
Planning:
   Buffers: shared hit=6
Planning Time: 0.222 ms
Execution Time: 54.214 ms
(8 rows)

Before applying my patch, the estimated row count is significantly lower
than the actual row count due to the independence assumption.

I propose a patch that improves selectivity estimation when the OR
clause contains only equality conditions on the same column. The patch
currently supports both simple equality expressions and IN. If all
predicates meet these criteria, the selectivity is computed as a sum
rather than using the default formula.

Here are the results of applying the patch to the same example:
EXPLAIN ANALYZE SELECT * FROM test_table WHERE a = 1 OR a = 2;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Seq Scan on test_table (cost=0.00..19425.00 rows=746800 width=8)
(actual time=0.062..46.515 rows=750000.00 loops=1)
   Filter: ((a = 1) OR (a = 2))
   Rows Removed by Filter: 250000
   Buffers: shared read=4425
Planning:
   Buffers: shared hit=41 read=11
Planning Time: 1.019 ms
Execution Time: 61.013 ms
(8 rows)

This change improves cardinality estimation for such queries.

I am open to any suggestions, feedback, or improvements.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment	Content-Type	Size
v1-0001-Improve-selectivity-estimation-for-OR-clauses.patch	text/x-patch	7.5 KB

Responses

Re: [PATCH] Improve selectivity estimation for OR clauses with equality conditions on the same column at 2025-03-06 10:43:37 from Ilia Evdokimov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2025-03-03 16:05:50	Re: making EXPLAIN extensible
Previous Message	Tom Lane	2025-03-03 15:49:30	Re: Options to control remote transactions’ access/deferrable modes in postgres_fdw