Re: planner chooses incremental but not the best one

From: ywgrit <yw987194828(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: planner chooses incremental but not the best one
Date: 2023-12-26 09:25:05
Message-ID: CAPt2h2b683yMzVDVtCWiC_FNEym19tt4cdcxhUsQVCVEZ8z2dQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,Tomas

Recently, I looked at papers related to estimation of cardinarity with
selection. I may be biased towards the scheme provided by the paper
"Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries
and Event Reports". This paper uses distinct sampling as opposed to the
current uniform sampling, and the main differences between the two are as
follows.

1)It not only counts the distincts on each column or multiple columns, but
also saves some rows corresponding to each distinct value, i.e., it saves
some part of the rows of the original relation as samples. The purpose of
saving these rows is to accommodate restrictions on the queries, such as
where clauses.

2)The queries are executed on the samples, and the result of the execution
is used as the statistical value of cardinarity.

The advantages of this paper over existing practices are as follows.

1)The samples collected can be applied to arbitrary predicates, e.g.
predicates that are correlated with the columns of group by clauses.

2)The accuracy is very high, and in some scenarios, the statistical error
can be minimized by hundreds of times compared to uniform sampling.

However, the scheme provided in this paper also has some defects, as
mentioned above, the scheme relies on the collected samples, which will
lead to a significant increase in the storage overhead of statistical
information.

I'd like to hear your opinions.

ywgrit.

ywgrit <yw987194828(at)gmail(dot)com> 于2023年12月22日周五 16:20写道:

> The possible solution of one scenario I can come up with so far is the
> query's predicate columns and group columns belonging to one table .
>
> For a query that contains where clause, perhaps num_groups could be
> estimated according to the following formula.
>
> num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where
> clause * ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
> sort_col_2, ... sort_col_m) / ndistinct(pred_col_1, pred_col_2, ...
> pred_col_n).
>
> ndistinct(pred_col_1, pred_col_2, ... pred_col_n) with where clause =
> ndistinct(pred_var_1, pred_var_2, ... pred_var_n) * selectivity of rel.
>
> pred_col_n belong to the columns involved in the where clause and
> sort_col_m belong to the columns involved in the group by clause.
>
> The reason for multiplying by selectivity of rel directly is that the
> selectivity of rel depends on only pred_col not sort_col. So the above
> formula can be simplified as follows.
>
> num_groups = ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1,
> sort_col_2, ... sort_col_m) * selectivity of rel.
>
> The correctness of the above formula depends on the following conditions.
>
> -
>
> ndistinct(pred_col_1, pred_col_2, ... pred_col_n)*
> ndistinct(pred_col_1, pred_col_2, ... pred_col_n, sort_col_1, sort_col_2,
> ... sort_col_m) statistics already exist, and need be accurate.
> -
>
> Both pred_col_n and sort_col_m are uniformly distributed, if not,
> statistics such as mcv are needed for correction.
> -
>
> The tuples of rel are the number of total tuples of the table , not
> the number of filtered tuples.
>
> After experimentation, in the scenario mentioned in previous thread. The
> estimate num_groups is 3, the accuracy of result strongly relies on the
> uniform distribution of b, which makes ndistinct(pred_col_1, pred_col_2,
> ... pred_col_n) with where clause could be able to estimated accurately.
>
> I'd like to hear your opinions.
>
> Regards.
>
> ywgrit.
>
> Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> 于2023年12月18日周一 20:53写道:
>
>>
>>
>> On 12/18/23 11:40, Richard Guo wrote:
>> >
>> > On Mon, Dec 18, 2023 at 7:31 AM Tomas Vondra
>> > <tomas(dot)vondra(at)enterprisedb(dot)com <mailto:tomas(dot)vondra(at)enterprisedb(dot)com>>
>> > wrote:
>> >
>> > Oh! Now I see what you meant by using the new formula in 84f9a35e3
>> > depending on how we sum tuples. I agree that seems like the right
>> thing.
>> >
>> > I'm not sure it'll actually help with the issue, though - if I
>> apply the
>> > patch, the plan does not actually change (and the cost changes just
>> a
>> > little bit).
>> >
>> > I looked at this only very briefly, but I believe it's due to the
>> > assumption of independence I mentioned earlier - we end up using
>> the new
>> > formula introduced in 84f9a35e3, but it assumes it assumes the
>> > selectivity and number of groups are independent. But that'd not the
>> > case here, because the groups are very clearly correlated (with the
>> > condition on "b").
>> >
>> >
>> > You're right. The patch allows us to adjust the estimate of distinct
>> > values for appendrels using the new formula introduced in 84f9a35e3.
>> > But if the restrictions are correlated with the grouping expressions,
>> > the new formula does not behave well. That's why the patch does not
>> > help in case [1], where 'b' and 'c' are correlated.
>> >
>> > OTOH, if the restrictions are not correlated with the grouping
>> > expressions, the new formula would perform quite well. And in this case
>> > the patch would help a lot, as shown in [2] where estimate_num_groups()
>> > gives a much more accurate estimate with the help of this patch.
>> >
>> > So this patch could be useful in certain situations. I'm wondering if
>> > we should at least have this patch (if it is right).
>> >
>>
>> I do agree the patch seems to do the right thing, and it's worth pushing
>> on it's own.
>>
>> >
>> > If that's the case, I'm not sure how to fix this :-(
>> >
>> >
>> > The commit message of 84f9a35e3 says
>> >
>> > This could possibly be improved upon in the future by identifying
>> > correlated restrictions and using a hybrid of the old and new
>> > formulae.
>> >
>> > Maybe this is something we can consider trying. But anyhow this is not
>> > an easy task I suppose.
>>
>> Yeah, if it was easy, it'd have been done in 84f9a35e3 already ;-)
>>
>> The challenge is where to get usable information about correlation
>> between columns. I only have a couple very rought ideas of what might
>> try. For example, if we have multi-column ndistinct statistics, we might
>> look at ndistinct(b,c) and ndistinct(b,c,d) and deduce something from
>>
>> ndistinct(b,c,d) / ndistinct(b,c)
>>
>> If we know how many distinct values we have for the predicate column, we
>> could then estimate the number of groups. I mean, we know that for the
>> restriction "WHERE b = 3" we only have 1 distinct value, so we could
>> estimate the number of groups as
>>
>> 1 * ndistinct(b,c)
>>
>> I'm well aware this is only a very trivial example, and for more complex
>> examples it's likely way more complicated. But hopefully it illustrates
>> the general idea.
>>
>> The other idea would be to maybe look at multi-column MCV, and try using
>> it to deduce cross-column correlation too (it could be more accurate for
>> arbitrary predicates).
>>
>> And finally, we might try being a bit more pessimistic and look at what
>> the "worst case" behavior could be. So for example instead of trying to
>> estimate the real number of groups, we'd ask "What's the minimum number
>> of groups we're likely to get?". And use that to cost the incremental
>> sort. But I don't think we do this elsewhere, and I'm not sure we want
>> to start with this risk-based approach here. It might be OK, because we
>> usually expect the incremental sort to be much cheaper, ...
>>
>> If this something would be interested in exploring? I don't have
>> capacity to work on this myself, but I'd be available for reviews,
>> feedback and so on.
>>
>> regards
>>
>> --
>> Tomas Vondra
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2023-12-26 09:30:47 RE: Synchronizing slots from primary to standby
Previous Message Bertrand Drouvot 2023-12-26 08:50:00 Re: Track in pg_replication_slots the reason why slots conflict?