Re: [PATCH] Improve selectivity estimation for OR clauses with equality conditions on the same column

From: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH] Improve selectivity estimation for OR clauses with equality conditions on the same column
Date: 2025-03-06 10:43:37
Message-ID: 720a76a3-59ad-4a88-ba5b-f47786056326@tantorlabs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 03.03.2025 18:57, Ilia Evdokimov wrote:
>
> hi hackers,
>
> When estimating the selectivity of an OR clause without extended
> statistics, we assume that the conditions are independent. However, it
> is quite common to see queries like (a = 1 OR a = 2). In such cases,
> the assumption of independence can lead to a significant
> underestimation of selectivity
>
> Even when comparing OR conditions to IN, the results can differ
> significantly. This issue is particularly noticeable when the number
> of predicates in the OR clause is small, the table is large, and
> ndistinct is low. Here’s an example of such a query on a large table
> with low ndistinct:
>
> CREATE TABLE test_table ( id SERIAL PRIMARY KEY, a INT );
> INSERT INTO test_table (a) SELECT 1 FROM generate_series(1, 500000);
> -- 500 000 rows with a = 1
> INSERT INTO test_table (a) SELECT 2 FROM generate_series(1, 250000);
> -- 250 000 rows with a = 2
> INSERT INTO test_table (a) SELECT 3 FROM generate_series(1, 250000);
> -- 250 000 rows with a = 3
> ANALYZE test_table;
>
> EXPLAIN ANALYZE SELECT * FROM test_table WHERE a IN (1, 2);
>                                                       QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------
>  Seq Scan on test_table  (cost=0.00..16925.00 rows=746800 width=8)
> (actual time=0.055..50.064 rows=750000.00 loops=1)
>    Filter: (a = ANY ('{1,2}'::integer[]))
>    Rows Removed by Filter: 250000
>    Buffers: shared hit=32 read=4393
>  Planning:
>    Buffers: shared hit=33 read=15
>  Planning Time: 1.056 ms
>  Execution Time: 64.800 ms
> (8 rows)
>
> EXPLAIN ANALYZE SELECT * FROM test_table WHERE a = 1 OR a = 2;
>                                                       QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------
>  Seq Scan on test_table  (cost=0.00..19425.00 rows=622674 width=8)
> (actual time=0.029..40.451 rows=750000.00 loops=1)
>    Filter: ((a = 1) OR (a = 2))
>    Rows Removed by Filter: 250000
>    Buffers: shared hit=4425
>  Planning:
>    Buffers: shared hit=6
>  Planning Time: 0.222 ms
>  Execution Time: 54.214 ms
> (8 rows)
>
> Before applying my patch, the estimated row count is significantly
> lower than the actual row count due to the independence assumption.
>
> I propose a patch that improves selectivity estimation when the OR
> clause contains only equality conditions on the same column. The patch
> currently supports both simple equality expressions and IN. If all
> predicates meet these criteria, the selectivity is computed as a sum
> rather than using the default formula.
>
> Here are the results of applying the patch to the same example:
> EXPLAIN ANALYZE SELECT * FROM test_table WHERE a = 1 OR a = 2;
>                                                       QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------
>  Seq Scan on test_table  (cost=0.00..19425.00 rows=746800 width=8)
> (actual time=0.062..46.515 rows=750000.00 loops=1)
>    Filter: ((a = 1) OR (a = 2))
>    Rows Removed by Filter: 250000
>    Buffers: shared read=4425
>  Planning:
>    Buffers: shared hit=41 read=11
>  Planning Time: 1.019 ms
>  Execution Time: 61.013 ms
> (8 rows)
>
> This change improves cardinality estimation for such queries.
>
> I am open to any suggestions, feedback, or improvements.
>
> --
> Best regards,
> Ilia Evdokimov,
> Tantor Labs LLC.
>

After reviewing real query patterns, I have not encountered cases where
IN appears in the same column in OR-clauses. Therefore, I've decided to
to focus only on 'Var = Const' cases for now. I attached v2 patch with
changes.

A concrete example where this issue arises is TPC-DS query 41. This
query contains a lot of OR conditions on a single column, and the
current estimation method may lead to inaccurate rows count estimates.
While I have not yet benchmarked the full impact, the query structure
suggests that improving OR clause estimation could lead to better plan
choices.

If there interest, I can provide EXPLAIN ANALYZE comparisors before and
after applying v2 patch to demonstrate the impact on row estimates and
execution plans.

Any thoughts?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachment Content-Type Size
v2-0001-Improve-selectivity-estimation-for-OR-clauses.patch text/x-patch 6.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Shay Rojansky 2025-03-06 10:55:23 Re: JSON_VALUE() behavior when RETURNING bytea (expected base64 decoding)
Previous Message Bertrand Drouvot 2025-03-06 10:42:40 Re: Monitoring gaps in XLogWalRcvWrite() for the WAL receiver