Complex filters -> Bad row estimates -> bad query plan

From: Mathieu Fenniak <mathieu(dot)fenniak(at)replicon(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Complex filters -> Bad row estimates -> bad query plan
Date: 2019-08-21 15:53:22
Message-ID: CAHoiPjzcv4T+5a7giM4925Ltj+VegQkmG6+_1PUtGX3-dZ141A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

I have a database query where I have a number of "simple" where clauses, a
number of "complex" subquery based where clauses, and one NOT EXISTS where
clause; it looks something like this:

SELECT ...some fields... FROM Table1
WHERE
Field1 IN (1, 2, 3, 4, 21, 24) AND -- simple filter
Field2 <> 1 AND -- simple filter
Field3 >= '2019-07-08' AND -- simple filter
Field3 <= '2019-08-18' AND -- simple filter
NOT EXISTS (SELECT 1 FROM Table2 WHERE Table2.Status = 2 AND
Table2.SomeId = Table1.Id) AND -- anti-join
COALESCE((SELECT Status FROM Table3 WHERE Table3.SomeId = Table1.Id),
(SELECT Status FROM Table4 WHERE Table4.SomeId = Table1.Id)) = 2 --
"complex" condition

The problem I'm encountering is that I've observed degraded performance in
some cases where the Anti Join merge for the NOT EXISTS clause is planned
based upon poor row estimates for Table1. All of the other filters, and
the SubPlan filter(s) for the complex clauses, result in the query planner
estimating that only 1 row from Table1 will be resulting, so a Nested Loop
Anti Join is used and the RHS of that nested loop in a seqscan on Table2.
The reality is that many thousands of records match all the conditions; a
Merge Anti Join or Hash Anti Join would be a better query plan.

I've tested the query planner with just the simpler conditions, and it
makes pretty reasonable estimates about the row count (+/- 10%). Adding
the NOT EXISTS results in a Merge Anti Join, and performance is great.

Adding the more "complex" conditions (there are potentially multiple of
these subquery plan searches) results in the estimated row count dropping
to 1, and, performance dives.

I know there are no "query hints" in PostgreSQL... any thoughts on
alternative approaches here? The only option I've used in the past for
this is creating specialized indexes, which can provide more targeted
statistics; but it's not applicable here since the "complex" conditions use
data from another table in a subquery.

Appreciate any thoughts, theories, or directions. :-) Thanks,

Mathieu

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Rob Sargent 2019-08-21 17:54:14 Re: SELECT all the rows where id is children of other node.
Previous Message Adrian Klaver 2019-08-21 15:47:05 Re: Rename a column if not already renamed.?