From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Additional improvements to extended statistics |
Date: | 2020-03-09 18:19:15 |
Message-ID: | 20200309181915.5lxhuw2qxoihfoqo@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Mar 09, 2020 at 08:35:48AM +0000, Dean Rasheed wrote:
>On Mon, 9 Mar 2020 at 00:02, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>> Speaking of which, would you take a look at [1]? I think supporting SAOP
>> is fine, but I wonder if you agree with my conclusion we can't really
>> support inclusion @> as explained in [2].
>>
>
>Hmm, I'm not sure. However, thinking about your example in [2] reminds
>me of a thought I had a while ago, but then forgot about --- there is
>a flaw in the formula used for computing probabilities with functional
>dependencies:
>
> P(a,b) = P(a) * [f + (1-f)*P(b)]
>
>because it might return a value that is larger that P(b), which
>obviously should not be possible.
Hmmm, yeah. It took me a while to reproduce this - the trick is in
"inverting" the dependency on a subset of data, e.g. like this:
create table t (a int, b int);
insert into t select mod(i,50), mod(i,25)
from generate_series(1,5000) s(i);
update t set a = 0 where a < 3;
create statistics s (dependencies) on a,b from t;
which then does this:
test=# explain select * from t where a = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=300 width=8)
Filter: (a = 0)
(2 rows)
test=# explain select * from t where b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..86.50 rows=200 width=8)
Filter: (b = 0)
(2 rows)
test=# explain select * from t where a = 0 and b = 0;
QUERY PLAN
----------------------------------------------------
Seq Scan on t (cost=0.00..99.00 rows=283 width=8)
Filter: ((a = 0) AND (b = 0))
(2 rows)
Which I think is the issue you've described ...
>We should amend that formula to prevent a result larger than P(b). The
>obvious way to do that would be to use:
>
> P(a,b) = Min(P(a) * [f + (1-f)*P(b)], P(b))
>
>but actually I think it would be better and more principled to use:
>
> P(a,b) = f*Min(P(a),P(b)) + (1-f)*P(a)*P(b)
>
>I.e., for those rows believed to be functionally dependent, we use the
>minimum probability, and for the rows believed to be independent, we
>use the product.
>
Hmmm, yeah. The trouble is that we currently don't really have both
selectivities in dependencies_clauselist_selectivity :-(
We get both clauses, but we only compute selectivity of the "implied"
clause, and we leave the other one as not estimated (possibly up to
clauselist_selectivity).
It's also not clear to me how would this work for more than two clauses,
that are all functionally dependent. Like (a => b => c), for example.
But I haven't thought about this very much yet.
>I think that would solve the problem with the example you gave at the
>end of [2], but I'm not sure if it helps with the general case.
>
I don't follow. I think the issue with [2] is that we can't really apply
stats about the array values to queries on individual array elements.
Can you explain how would the proposed changes deal with this?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2020-03-09 18:30:18 | Re: logical replication empty transactions |
Previous Message | Robert Haas | 2020-03-09 17:53:18 | Re: range_agg |