Cardinality estimate of the inner relation

From: Frédéric Yhuel <frederic(dot)yhuel(at)dalibo(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Cc: Christophe Courtois <christophe(dot)courtois(at)dalibo(dot)com>
Subject: Cardinality estimate of the inner relation
Date: 2024-11-22 15:53:39
Message-ID: b860c71a-7cab-4d88-ad87-8c1f2eea9ae8@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

My colleague Christophe Courtois and I have been trying to fix a bad
plan for one of Dalibo's clients. It is a (probably well-known) problem
with skewed data and a parameterized Nested Loop with an underestimation
of the cardinality of the inner relation.

Here is a test case (the script to create and populate the two tables is
at the end):

EXPLAIN (ANALYZE, SETTINGS)
SELECT name, sum(quantity)
FROM products p
JOIN orders o ON (p.id = o.product_id)
WHERE p.name = 'babar' /* 28% of orders */
GROUP BY name;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=5.50..1022.27 rows=1 width=40) (actual
time=66.213..66.215 rows=1 loops=1)
-> Nested Loop (cost=5.50..1019.29 rows=1189 width=36) (actual
time=0.083..56.991 rows=320000 loops=1)
-> Bitmap Heap Scan on products p (cost=5.08..250.17 rows=85
width=40) (actual time=0.058..0.153 rows=80 loops=1)
Recheck Cond: (name = 'babar'::text)
Heap Blocks: exact=80
-> Bitmap Index Scan on products_name_idx
(cost=0.00..5.06 rows=85 width=0) (actual time=0.030..0.031 rows=80 loops=1)
Index Cond: (name = 'babar'::text)
-> Index Scan using orders_product_id_idx on orders o
(cost=0.43..8.79 rows=26 width=12) (actual time=0.006..0.440 rows=4000
loops=80)
Index Cond: (product_id = p.id)
Planning Time: 0.560 ms
Execution Time: 66.268 ms

The cardinaly estimate on 'orders' is wrong (26 rows against 4000),
hence the Nested Loop.
Any other value for product.name gives less rows and a good plan (i.e.
NL makes more sense).

If I'm not wrong, the row count estimate for the inner table is computed
as follow:

SELECT n_distinct AS product_id_n_distinct FROM pg_stats
WHERE tablename = 'orders' AND attname = 'product_id' \gset

SELECT reltuples AS orders_tuples FROM pg_class
WHERE relname = 'orders' \gset

SELECT :orders_tuples / :product_id_n_distinct AS inner_estimation;
inner_estimation
---------------------
26.4049450290190157

I have two questions:

1) Is there a known workaround for this problem (besides unatural
partitioning of the inner table)?

2) Would it be possible to add cross-table statistics to fix this? Has
this been discussed?

Here is the script to create and populate the two tables:

SELECT 80000 AS nbp \gset

DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;

CREATE UNLOGGED TABLE products(
id bigint PRIMARY KEY,
name TEXT,
price INT
);

CREATE UNLOGGED TABLE orders(
id bigint generated always as identity,
product_id BIGINT, -- REFERENCES products(id),
quantity INT
);

INSERT INTO products (id, name, price)
SELECT
i,
CASE WHEN i % 1000 = 0 THEN 'babar' ELSE md5(i::text) END,
random() * 1000
FROM generate_series(1, :nbp) AS T(i);

INSERT INTO orders (product_id, quantity)
SELECT product_id, quantity FROM (
SELECT
id AS product_id,
random() * 10 AS quantity,
name
FROM products,
LATERAL (SELECT * FROM generate_series (1, CASE WHEN name = 'babar'
THEN 4000 ELSE 10 END)) T) U;

CREATE INDEX ON orders (product_id);
VACUUM ANALYZE products, orders;

ALTER TABLE orders ADD FOREIGN KEY (product_id) REFERENCES products(id);

CREATE INDEX ON products (name);

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2024-11-23 01:38:13 Re: could not send data to client: Connection reset by peer
Previous Message Andrei Lepikhov 2024-11-22 14:32:57 Re: Performance of Query 60 on TPC-DS Benchmark