Inaccurate (sometimes wildly so) row estimates for simple join

From: Greg Nolle <greg(dot)nolle(at)voidbridge(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Inaccurate (sometimes wildly so) row estimates for simple join
Date: 2020-05-22 13:27:03
Message-ID: CA+wHfn-KEfEfMFWFWxpWnROX3NQrQJWn86Jm6h-4A_C7gLCYyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I’m working on diagnosing an issue with a complex query on a large PG11
database we have. The planner is choosing a plan that takes 60+ seconds but
if we force it to use a particular index then it takes only a couple of
seconds. I’ve narrowed down what I think is the cause to a very simple join
for which PG is underestimating the rows by a factor of 20. It then chooses
to do a nested loop which is actually much slower than it thinks it would
be based on the estimate.

I’ve managed to reproduce this underestimation issue using a simple data
set that is very similar in distribution to what we have in our real
database:

CREATE TABLE test_a (
a_id VARCHAR(255) PRIMARY KEY,
group_val VARCHAR(255) NOT NULL
);
INSERT INTO test_a (a_id, group_val) SELECT 'a_id'||s, 'group'||s FROM
generate_series(1, 100) s;
INSERT INTO test_a (a_id, group_val) VALUES ('a_id101', 'group1');

CREATE TABLE test_b (
b_id VARCHAR(255) PRIMARY KEY,
a_id VARCHAR(255) NOT NULL,
CONSTRAINT test_bk_fk01 FOREIGN KEY (a_id) REFERENCES test_a (a_id)
);
INSERT INTO test_b (b_id, a_id) SELECT 'b_id'||s, 'a_id1' FROM
generate_series(1, 1500) s;
INSERT INTO test_b (b_id, a_id) SELECT 'b_id'||s, 'a_id2' FROM
generate_series(1501, 2000) s;

ANALYZE test_a;
ANALYZE test_b;

EXPLAIN ANALYZE SELECT * FROM test_b b JOIN test_a a ON a.a_id = b.a_id
WHERE a.group_val = 'group1';

This gives an estimated 40 rows for the join when in fact there are 1500
rows. The crux seems to be that test_b does not have an even distribution
for a_id values: it only has records for two of the values in the
referenced table. This is how our real dataset is too and isn’t something
we can really change.

I’ve read through the row estimation information here:
https://www.postgresql.org/docs/11/row-estimation-examples.html and also
read through the source code for the eqjoinsel_inner function. I can see
how it’s calculating the estimate but I have no clue how it can be made
accurate for this (seemingly) simple scenario.

Any hints or tips would be greatly appreciated.

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2020-05-22 13:27:32 Re: pg_dump crashes
Previous Message Ted Toth 2020-05-22 13:02:32 FDW and RLS