Re: Unclamped row estimates whith OR-ed subplans

From: "Benjamin Coutu" <ben(dot)coutu(at)zeyos(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Unclamped row estimates whith OR-ed subplans
Date: 2020-06-19 17:33:31
Message-ID: 20200619173332.8A83D5FB26@mx.zeyos.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> No. The subplan estimates are for the number of rows produced by one
> execution of the subplan, ie the numbers of "accounts" or "contracts"
> rows that match those inner WHERE conditions. This has very little
> a-priori relationship to the number of "transactions" rows that will
> satisfy the outer WHERE condition. If we knew that transactions.account
> and transactions.contract were unique columns, then we might be able
> to say that there shouldn't be more than one outer match per subplan
> result row ... but you didn't say that, and it seems unlikely.

Thanks Tom, I understand your point.

I don't want to waste your time but maybe there is room for improvement as both "account" and "contract" are highly distinct and the individual subplan estimates are quite accurate:

Seq Scan on transactions (cost=67.81..171458.63 rows=1301316 width=1206) (actual time=69.418..917.594 rows=112 loops=1)
Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
Rows Removed by Filter: 1792937
SubPlan 1
-> Bitmap Heap Scan on accounts (cost=33.96..61.76 rows=46 width=4) (actual time=3.053..3.292 rows=111 loops=1)
Recheck Cond: ((name)::text ~~* '%test%'::text)
Rows Removed by Index Recheck: 4
Heap Blocks: exact=104
-> Bitmap Index Scan on s_accounts (cost=0.00..33.95 rows=46 width=0) (actual time=0.505..0.505 rows=118 loops=1)
Index Cond: ((name)::text ~~* '%test%'::text)
SubPlan 2
-> Seq Scan on contracts (cost=0.00..5.93 rows=5 width=4) (actual time=2.531..2.836 rows=4 loops=1)
Filter: ((name)::text ~~* '%test%'::text)
Rows Removed by Filter: 272

For comparison here are the plans for the queries with the individual where clauses:

SELECT * FROM "transactions" WHERE "account" IN (SELECT "ID" FROM "accounts" WHERE "name" ~~* '%test%')

Nested Loop (cost=34.38..488.93 rows=155 width=1206) (actual time=0.599..1.393 rows=112 loops=1)
-> Bitmap Heap Scan on accounts (cost=33.96..61.76 rows=46 width=4) (actual time=0.541..0.796 rows=111 loops=1)
Recheck Cond: ((name)::text ~~* '%test%'::text)
Rows Removed by Index Recheck: 4
Heap Blocks: exact=104
-> Bitmap Index Scan on s_accounts (cost=0.00..33.95 rows=46 width=0) (actual time=0.521..0.521 rows=118 loops=1)
Index Cond: ((name)::text ~~* '%test%'::text)
-> Index Scan using fk_transactions_account on transactions (cost=0.43..9.08 rows=21 width=1206) (actual time=0.004..0.005 rows=1 loops=111)
Index Cond: (account = accounts."ID")

SELECT * FROM "transactions" WHERE "contract" IN (SELECT "ID" FROM "contracts" WHERE "name" ~~* '%test%')

Nested Loop (cost=3.76..10.10 rows=31662 width=1206) (actual time=0.082..0.082 rows=0 loops=1)
-> Bitmap Heap Scan on contracts (cost=3.64..5.74 rows=5 width=4) (actual time=0.069..0.075 rows=4 loops=1)
Recheck Cond: ((name)::text ~~* '%test%'::text)
Heap Blocks: exact=2
-> Bitmap Index Scan on s_contracts (cost=0.00..3.64 rows=5 width=0) (actual time=0.060..0.060 rows=4 loops=1)
Index Cond: ((name)::text ~~* '%test%'::text)
-> Index Scan using fk_transactions_contract on transactions (cost=0.12..0.86 rows=1 width=1206) (actual time=0.001..0.001 rows=0 loops=4)
Index Cond: (contract = contracts."ID")

The statistics for the columns are:

SELECT attname, null_frac, n_distinct from pg_stats WHERE tablename = 'transactions' AND attname IN ('account', 'contract')

transactions.account: null_frac=0.025 n_distinct=80277
transactions.contract: null_frac=1 n_distinct=0 (there are basically no non-null values for field "contract" in transactions)

According to pg_class.reltuples the table "transactions" has 1735088 rows.

I'd naively expect the selectivity for an OR of those two hashed subplans given uniform distribution to be:

rows_total =
((rows_transactions * (1 - null_frac_account)) / n_distinct_account) * expected_rows_from_subplan1 +
((rows_transactions * (1 - null_frac_contract)) / n_distinct_contract) * expected_rows_from_subplan2

=> rows_total =
((1735088 * (1 - 0.025)) / 80277) * 46 +
((1735088 * (1 - 1)) / 0) * 5

=> rows_total = 969 + 0 /* no non-null values for contract field */

Please forgive the sloppy math but something along this line could be promising.

Btw, I don't quite understand why the nested loop on contract only is expected to yield 31662 rows, when the null_frac of field transactions.contract is 1. Shouldn't that indicate zero rows or some kind of default minimum estimate for that query?

Thanks again!

Benjamin Coutu

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2020-06-19 18:16:49 Re: Unclamped row estimates whith OR-ed subplans
Previous Message David G. Johnston 2020-06-19 16:14:57 Re: Unclamped row estimates whith OR-ed subplans