From: | "Benjamin Coutu" <ben(dot)coutu(at)zeyos(dot)com> |
---|---|
To: | "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Unclamped row estimates whith OR-ed subplans |
Date: | 2020-06-19 17:33:31 |
Message-ID: | 20200619173332.8A83D5FB26@mx.zeyos.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
> No. The subplan estimates are for the number of rows produced by one
> execution of the subplan, ie the numbers of "accounts" or "contracts"
> rows that match those inner WHERE conditions. This has very little
> a-priori relationship to the number of "transactions" rows that will
> satisfy the outer WHERE condition. If we knew that transactions.account
> and transactions.contract were unique columns, then we might be able
> to say that there shouldn't be more than one outer match per subplan
> result row ... but you didn't say that, and it seems unlikely.
Thanks Tom, I understand your point.
I don't want to waste your time but maybe there is room for improvement as both "account" and "contract" are highly distinct and the individual subplan estimates are quite accurate:
Seq Scan on transactions (cost=67.81..171458.63 rows=1301316 width=1206) (actual time=69.418..917.594 rows=112 loops=1)
Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
Rows Removed by Filter: 1792937
SubPlan 1
-> Bitmap Heap Scan on accounts (cost=33.96..61.76 rows=46 width=4) (actual time=3.053..3.292 rows=111 loops=1)
Recheck Cond: ((name)::text ~~* '%test%'::text)
Rows Removed by Index Recheck: 4
Heap Blocks: exact=104
-> Bitmap Index Scan on s_accounts (cost=0.00..33.95 rows=46 width=0) (actual time=0.505..0.505 rows=118 loops=1)
Index Cond: ((name)::text ~~* '%test%'::text)
SubPlan 2
-> Seq Scan on contracts (cost=0.00..5.93 rows=5 width=4) (actual time=2.531..2.836 rows=4 loops=1)
Filter: ((name)::text ~~* '%test%'::text)
Rows Removed by Filter: 272
For comparison here are the plans for the queries with the individual where clauses:
SELECT * FROM "transactions" WHERE "account" IN (SELECT "ID" FROM "accounts" WHERE "name" ~~* '%test%')
Nested Loop (cost=34.38..488.93 rows=155 width=1206) (actual time=0.599..1.393 rows=112 loops=1)
-> Bitmap Heap Scan on accounts (cost=33.96..61.76 rows=46 width=4) (actual time=0.541..0.796 rows=111 loops=1)
Recheck Cond: ((name)::text ~~* '%test%'::text)
Rows Removed by Index Recheck: 4
Heap Blocks: exact=104
-> Bitmap Index Scan on s_accounts (cost=0.00..33.95 rows=46 width=0) (actual time=0.521..0.521 rows=118 loops=1)
Index Cond: ((name)::text ~~* '%test%'::text)
-> Index Scan using fk_transactions_account on transactions (cost=0.43..9.08 rows=21 width=1206) (actual time=0.004..0.005 rows=1 loops=111)
Index Cond: (account = accounts."ID")
SELECT * FROM "transactions" WHERE "contract" IN (SELECT "ID" FROM "contracts" WHERE "name" ~~* '%test%')
Nested Loop (cost=3.76..10.10 rows=31662 width=1206) (actual time=0.082..0.082 rows=0 loops=1)
-> Bitmap Heap Scan on contracts (cost=3.64..5.74 rows=5 width=4) (actual time=0.069..0.075 rows=4 loops=1)
Recheck Cond: ((name)::text ~~* '%test%'::text)
Heap Blocks: exact=2
-> Bitmap Index Scan on s_contracts (cost=0.00..3.64 rows=5 width=0) (actual time=0.060..0.060 rows=4 loops=1)
Index Cond: ((name)::text ~~* '%test%'::text)
-> Index Scan using fk_transactions_contract on transactions (cost=0.12..0.86 rows=1 width=1206) (actual time=0.001..0.001 rows=0 loops=4)
Index Cond: (contract = contracts."ID")
The statistics for the columns are:
SELECT attname, null_frac, n_distinct from pg_stats WHERE tablename = 'transactions' AND attname IN ('account', 'contract')
transactions.account: null_frac=0.025 n_distinct=80277
transactions.contract: null_frac=1 n_distinct=0 (there are basically no non-null values for field "contract" in transactions)
According to pg_class.reltuples the table "transactions" has 1735088 rows.
I'd naively expect the selectivity for an OR of those two hashed subplans given uniform distribution to be:
rows_total =
((rows_transactions * (1 - null_frac_account)) / n_distinct_account) * expected_rows_from_subplan1 +
((rows_transactions * (1 - null_frac_contract)) / n_distinct_contract) * expected_rows_from_subplan2
=> rows_total =
((1735088 * (1 - 0.025)) / 80277) * 46 +
((1735088 * (1 - 1)) / 0) * 5
=> rows_total = 969 + 0 /* no non-null values for contract field */
Please forgive the sloppy math but something along this line could be promising.
Btw, I don't quite understand why the nested loop on contract only is expected to yield 31662 rows, when the null_frac of field transactions.contract is 1. Shouldn't that indicate zero rows or some kind of default minimum estimate for that query?
Thanks again!
Benjamin Coutu
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2020-06-19 18:16:49 | Re: Unclamped row estimates whith OR-ed subplans |
Previous Message | David G. Johnston | 2020-06-19 16:14:57 | Re: Unclamped row estimates whith OR-ed subplans |