From: | Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [PATCH] Improve selectivity estimation for OR clauses with equality conditions on the same column |
Date: | 2025-03-06 10:43:37 |
Message-ID: | 720a76a3-59ad-4a88-ba5b-f47786056326@tantorlabs.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 03.03.2025 18:57, Ilia Evdokimov wrote:
>
> hi hackers,
>
> When estimating the selectivity of an OR clause without extended
> statistics, we assume that the conditions are independent. However, it
> is quite common to see queries like (a = 1 OR a = 2). In such cases,
> the assumption of independence can lead to a significant
> underestimation of selectivity
>
> Even when comparing OR conditions to IN, the results can differ
> significantly. This issue is particularly noticeable when the number
> of predicates in the OR clause is small, the table is large, and
> ndistinct is low. Here’s an example of such a query on a large table
> with low ndistinct:
>
> CREATE TABLE test_table ( id SERIAL PRIMARY KEY, a INT );
> INSERT INTO test_table (a) SELECT 1 FROM generate_series(1, 500000);
> -- 500 000 rows with a = 1
> INSERT INTO test_table (a) SELECT 2 FROM generate_series(1, 250000);
> -- 250 000 rows with a = 2
> INSERT INTO test_table (a) SELECT 3 FROM generate_series(1, 250000);
> -- 250 000 rows with a = 3
> ANALYZE test_table;
>
> EXPLAIN ANALYZE SELECT * FROM test_table WHERE a IN (1, 2);
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------
> Seq Scan on test_table (cost=0.00..16925.00 rows=746800 width=8)
> (actual time=0.055..50.064 rows=750000.00 loops=1)
> Filter: (a = ANY ('{1,2}'::integer[]))
> Rows Removed by Filter: 250000
> Buffers: shared hit=32 read=4393
> Planning:
> Buffers: shared hit=33 read=15
> Planning Time: 1.056 ms
> Execution Time: 64.800 ms
> (8 rows)
>
> EXPLAIN ANALYZE SELECT * FROM test_table WHERE a = 1 OR a = 2;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------
> Seq Scan on test_table (cost=0.00..19425.00 rows=622674 width=8)
> (actual time=0.029..40.451 rows=750000.00 loops=1)
> Filter: ((a = 1) OR (a = 2))
> Rows Removed by Filter: 250000
> Buffers: shared hit=4425
> Planning:
> Buffers: shared hit=6
> Planning Time: 0.222 ms
> Execution Time: 54.214 ms
> (8 rows)
>
> Before applying my patch, the estimated row count is significantly
> lower than the actual row count due to the independence assumption.
>
> I propose a patch that improves selectivity estimation when the OR
> clause contains only equality conditions on the same column. The patch
> currently supports both simple equality expressions and IN. If all
> predicates meet these criteria, the selectivity is computed as a sum
> rather than using the default formula.
>
> Here are the results of applying the patch to the same example:
> EXPLAIN ANALYZE SELECT * FROM test_table WHERE a = 1 OR a = 2;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------
> Seq Scan on test_table (cost=0.00..19425.00 rows=746800 width=8)
> (actual time=0.062..46.515 rows=750000.00 loops=1)
> Filter: ((a = 1) OR (a = 2))
> Rows Removed by Filter: 250000
> Buffers: shared read=4425
> Planning:
> Buffers: shared hit=41 read=11
> Planning Time: 1.019 ms
> Execution Time: 61.013 ms
> (8 rows)
>
> This change improves cardinality estimation for such queries.
>
> I am open to any suggestions, feedback, or improvements.
>
> --
> Best regards,
> Ilia Evdokimov,
> Tantor Labs LLC.
>
After reviewing real query patterns, I have not encountered cases where
IN appears in the same column in OR-clauses. Therefore, I've decided to
to focus only on 'Var = Const' cases for now. I attached v2 patch with
changes.
A concrete example where this issue arises is TPC-DS query 41. This
query contains a lot of OR conditions on a single column, and the
current estimation method may lead to inaccurate rows count estimates.
While I have not yet benchmarked the full impact, the query structure
suggests that improving OR clause estimation could lead to better plan
choices.
If there interest, I can provide EXPLAIN ANALYZE comparisors before and
after applying v2 patch to demonstrate the impact on row estimates and
execution plans.
Any thoughts?
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Improve-selectivity-estimation-for-OR-clauses.patch | text/x-patch | 6.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Shay Rojansky | 2025-03-06 10:55:23 | Re: JSON_VALUE() behavior when RETURNING bytea (expected base64 decoding) |
Previous Message | Bertrand Drouvot | 2025-03-06 10:42:40 | Re: Monitoring gaps in XLogWalRcvWrite() for the WAL receiver |