Re: UniqueKey v2

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>
Subject: Re: UniqueKey v2
Date: 2024-05-14 03:15:58
Message-ID: 87h6f1kqr6.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Antonin Houska <ah(at)cybertec(dot)at> writes:

>> Could you make the reason clearer for adding 'List *opfamily_lists;'
>> into UniqueKey? You said "This is needed to create ECs in the parent
>> query if the upper relation represents a subquery." but I didn't get the
>> it. Since we need to maintain the UniqueKey in the many places, I'd like
>> to keep it as simple as possbile. Of course, anything essentical should
>> be added for sure.
>
> If unique keys are generated for a subquery output, they also need to be
> created for the corresponding relation in the upper query ("sub" in the
> following example):

OK.
>
> select * from tab1 left join (select * from tab2) sub;
>
> However, to create an unique key for "sub", you need an EC for each expression
> of the key.

OK.
> And to create an EC, you in turn need the list of operator
> families.

I'm thinking if we need to "create" any EC. Can you find out a user case
where the outer EC is missed and the UniqueKey is still interesting? I
don't have an example now.

convert_subquery_pathkeys has a similar sistuation and has the following
codes:

outer_ec =
get_eclass_for_sort_expr(root,
(Expr *) outer_var,
sub_eclass->ec_opfamilies,
sub_member->em_datatype,
sub_eclass->ec_collation,
0,
rel->relids,
NULL,
false);

/*
* If we don't find a matching EC, sub-pathkey isn't
* interesting to the outer query
*/
if (outer_ec)
best_pathkey =
make_canonical_pathkey(root,
outer_ec,
sub_pathkey->pk_opfamily,
sub_pathkey->pk_strategy,
sub_pathkey->pk_nulls_first);
}

> Even if the parent query already had ECs for the columns of "sub" which are
> contained in the unique key, you need to make sure that those ECs are
> "compatible" with the ECs of the subquery which generated the unique key. That
> is, if an EC of the subquery considers certain input values equal, the EC of
> the parent query must also be able to determine if they are equal or not.
>
>> > * RelOptInfo.notnullattrs
>> >
>> > My understanding is that this field contains the set attributes whose
>> > uniqueness is guaranteed by the unique key. They are acceptable because they
>> > are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
>> > value makes the containing row filtered out, so the row cannot break
>> > uniqueness of the output. Am I right?
>> >
>> > If so, I suggest to change the field name to something less generic, maybe
>> > 'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
>> > more checks are needed before particular attribute can actually be used in
>> > UniqueKey.
>>
>> I don't think so, UniqueKey is just one of the places to use this
>> not-null property, see 3af704098 for the another user case of it.
>>
>> (Because of 3af704098, we should leverage notnullattnums somehow in this
>> patch, which will be included in the next version as well).
>
> In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that
> does not happen to 'notnullattnums' in the current master branch. Thus I think
> that 'notnullattrs' is specific to the unique keys feature, so the field name
> should be less generic.

OK.

>> >
>> > * uniquekey_useful_for_merging()
>> >
>> > How does uniqueness relate to merge join? In README.uniquekey you seem to
>> > point out that a single row is always sorted, but I don't think this
>> > function is related to that fact. (Instead, I'd expect that pathkeys are
>> > added to all paths for a single-row relation, but I'm not sure you do that
>> > in the current version of the patch.)
>>
>> The merging is for "mergejoinable join clauses", see function
>> eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";
>
> My question is: why is the uniqueness important specifically to merge join? I
> understand that join evaluation can be more efficient if we know that one
> input relation is unique (i.e. we only scan that relation until we find the
> first match), but this is not specific to merge join.

So the answer is the "merging" in uniquekey_useful_for_merging() has
nothing with merge join.

>> > * is_uniquekey_useful_afterjoin()
>> >
>> > Now that my patch (0004) allows propagation of the unique keys from a
>> > subquery to the upper query, I was wondering if the UniqueKey structure
>> > needs the 'use_for_distinct field' I mean we should now propagate the unique
>> > keys to the parent query whether the subquery has DISTINCT clause or not. I
>> > noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
>> > changed the function to always returned true. However nothing changed in the
>> > output of regression tests (installcheck). Do you insist that the
>> > 'use_for_distinct' field is needed?
>
> I miss your answer to this comment.

After we considers the uniquekey from subquery, 'use_for_distinct' field
is not needed.

>> > ** What does the 'multi' word in the function name mean?
>>
>> multi means multiple, I thought we use this short name in the many
>> places, for ex bt_multi_page_stats after a quick search.
>
> Why not simply uniquekey_contains_nulls() ?

> Actually I wouldn't say that an instance of UniqueKey contains any value (NULL
> or NOT NULL) because it describes the whole relation rather than particular
> row. I consider UniqueKey to be a set of expressions. How about
> uniquekey_expression_nullable() ?

uniquekey_expression_nullable() is a better name, I will use it in the
next version.

IIUC, we have reached to the agreement based on your latest response for
the most of the questions. Please point me if I missed anything.

>> > Of course one problem is that the number of combinations can grow
>> > exponentially as new relations are joined.
>>
>> Yes, that's why "rule 1" needed and "How to reduce the overhead" in
>> UniqueKey.README is introduced.
>
> What if we are interested in unique keys of a subquery, but the subquery has
> no DISTINCT clause?

I agree we should remove the prerequisite of "use_for_distinct".

--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Wienhold 2024-05-14 03:18:24 Underscore in positional parameters?
Previous Message Andy Fan 2024-05-14 02:32:14 Re: First draft of PG 17 release notes