Problems with estimating OR conditions, IS NULL on LEFT JOINs

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Date: 2023-06-23 19:23:43
Message-ID: 52b55b53-420a-722c-adbd-706922fc059b@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I ran into a pretty terrible case of LEFT JOIN estimate, resulting in
pretty arbitrary underestimate. The query is pretty trivial, nothing
overly complex, and the more I think about it the more I think this is
a fairly fundamental flaw in how we estimate this type of joins.

Imagine you have two trivial tables:

CREATE TABLE large (id INT, a INT);
INSERT INTO large SELECT i, i FROM generate_series(1,1000000) s(i);

CREATE TABLE small (id INT, b INT);
INSERT INTO small SELECT i, i FROM generate_series(1,100) s(i);

The business meaning may be that "large" stores orders and "small" is
for events related to tiny fraction of the large table (e.g. returns).
And let's do a couple simple LEFT JOIN queries, adding conditions to it.

Let's start with no condition at all:

EXPLAIN ANALYZE
SELECT * FROM large LEFT JOIN small ON (large.id = small.id)

QUERY PLAN
----------------------------------------------------------------------
Hash Left Join (cost=3.25..18179.25 rows=1000000 width=16)
(actual time=0.069..550.290 rows=1000000 loops=1)
Hash Cond: (large.id = small.id)
-> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8)
(actual time=0.010..174.056 rows=1000000 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=8) (actual time=0.052...
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ...
Planning Time: 0.291 ms
Execution Time: 663.551 ms
(8 rows)

So great, this estimate is perfect. Now, let's add IS NULL condition on
the small table, to find rows without a match (e.g. orders that were not
returned):

EXPLAIN ANALYZE
SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
WHERE (small.id IS NULL);

QUERY PLAN
----------------------------------------------------------------------
Hash Anti Join (cost=3.25..27052.36 rows=999900 width=16)
(actual time=0.071..544.568 rows=999900 loops=1)
Hash Cond: (large.id = small.id)
-> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8)
(actual time=0.015..166.019 rows=1000000 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=8) (actual time=0.051...
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ...
Planning Time: 0.260 ms
Execution Time: 658.379 ms
(8 rows)

Also very accurate, great! Now let's do a condition on the large table
instead, filtering some the rows:

EXPLAIN ANALYZE
SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
WHERE (large.a IN (1000, 2000, 3000, 4000, 5000));

QUERY PLAN
----------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..20684.75 rows=5 width=16)
(actual time=0.957..127.376 rows=5 loops=1)
Join Filter: (large.id = small.id)
Rows Removed by Join Filter: 500
-> Seq Scan on large (cost=0.00..20675.00 rows=5 width=8)
(actual time=0.878..127.171 rows=5 loops=1)
Filter: (a = ANY ('{1000,2000,3000,4000,5000}'::integer[]))
Rows Removed by Filter: 999995
-> Materialize (cost=0.00..2.50 rows=100 width=8) ...
-> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ...
Planning Time: 0.223 ms
Execution Time: 127.407 ms
(10 rows)

Also great estimate! Surely, if we do both conditions with OR, we'll get
a decent estimate too?

EXPLAIN ANALYZE
SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
WHERE (small.id IS NULL)
OR (large.a IN (1000, 2000, 3000, 4000, 5000));

QUERY PLAN
----------------------------------------------------------------------
Hash Left Join (cost=3.25..18179.88 rows=5 width=16)
(actual time=0.073..580.827 rows=999900 loops=1)
Hash Cond: (large.id = small.id)
Filter: ((small.id IS NULL) OR
(large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))
Rows Removed by Filter: 100
-> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8)
(actual time=0.015..166.809 rows=1000000 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=8) (actual time=0.052...
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ...
Planning Time: 0.309 ms
Execution Time: 694.427 ms
(10 rows)

Well, bummer! This is pretty surprising, because if we know that clause
A produces estimate 1M and clause B estimates as 5, then it's expected
that (A OR B) should be estimated as something >= max(1M, 5). For users
running this, this has to be really surprising.

It's also quite serious, because with underestimates like this the
planner is likely to pick nestloops for additional joins, and we all
know how that performs for millions of rows ...

So, how does this happen? Well, the simple reason is that joins are
estimated by applying selectivities for all the clauses on a cartesian
product. So we calculate product (100 * 1M), and then apply selectivity
for the join condition, and the two single-table clauses.

The problem is that the selectivity for "IS NULL" is estimated using the
table-level statistics. But the LEFT JOIN entirely breaks the idea that
the null_frac has anything to do with NULLs in the join result. Because
the join result is not a subset of cartesian product - it's a superset.
Even if the small table has no NULLs, the join can have them - that's
the whole point of outer joins.

When there's only the IS NULL condition, we actually recognize this as
a special case and treat the join as antijoin (Hash Anti Join), and that
also fixes the estimates - antijoins do handle this fine. But as soon as
you change the condition a bit and do the IS NULL check on the other
column of the table, it goes wrong too:

EXPLAIN ANALYZE
SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
WHERE (small.b IS NULL);

QUERY PLAN
----------------------------------------------------------------------
Hash Left Join (cost=3.25..18179.25 rows=1 width=16)
(actual time=0.311..3110.298 rows=999900 loops=1)
Hash Cond: (large.id = small.id)
Filter: (small.b IS NULL)
Rows Removed by Filter: 100
-> Seq Scan on large (cost=0.00..14425.00 rows=1000000 width=8)
(actual time=0.014..1032.497 rows=1000000 loops=1)
-> Hash (cost=2.00..2.00 rows=100 width=8) (actual time=0.287...
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Seq Scan on small (cost=0.00..2.00 rows=100 width=8) ...
Planning Time: 0.105 ms
Execution Time: 4083.634 ms
(10 rows)

I'd bet most users would be rather surprised to learn this subtle change
makes a difference.

I wonder how to improve this, say by adjusting the IS NULL selectivity
when we know to operate on the outer side of the join. We're able to
do this for antijoins, so maybe we could do that here, somehow?

Unfortunately the important things (join type detection, IS NULL clause
estimation) happen pretty far away, but maybe we could pass that info
about fraction of NULLs introduced by he join to the nulltestsel, and
use it there (essentially by doing null_frac + join_null_frac). And
maybe we could do something like that for NULL tests on other columns
from the outer side ...

The other thing that might be beneficial is calculating boundaries for
the estimate. In this case we're capable of estimating the join size for
individual conditions, and then we could make ensure the final estimate
for the "a OR b" join is >= max of these cardinalities.

Of course, that might be expensive if we have to redo some of the join
planning/estimates for each individual condition (otherwise might not
notice the antijoin transformation and stuff like that).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2023-06-23 19:43:11 Re: Making empty Bitmapsets always be NULL
Previous Message Nathan Bossart 2023-06-23 17:54:16 Re: Preventing non-superusers from altering session authorization