Re: UniqueKey v2

From: Andy Fan <zhihuifan1213(at)163(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>
Subject: Re: UniqueKey v2
Date: 2024-05-06 07:48:10
Message-ID: 875xvrnzga.fsf@163.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hello Antonin,

Thanks for interesting with this topic!

> * I think that, before EC is considered suitable for an UK, its ec_opfamilies
> field needs to be checked. I try to do that in
> find_ec_position_matching_expr(), see 0004.

Could you make the reason clearer for adding 'List *opfamily_lists;'
into UniqueKey? You said "This is needed to create ECs in the parent
query if the upper relation represents a subquery." but I didn't get the
it. Since we need to maintain the UniqueKey in the many places, I'd like
to keep it as simple as possbile. Of course, anything essentical should
be added for sure.

>
> * Set-returning functions (SRFs) can make a "distinct path" necessary even if
> the join output is unique.

You are right at this point, I will fix it in the coming version.

>
> * RelOptInfo.notnullattrs
>
> My understanding is that this field contains the set attributes whose
> uniqueness is guaranteed by the unique key. They are acceptable because they
> are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
> value makes the containing row filtered out, so the row cannot break
> uniqueness of the output. Am I right?
>
> If so, I suggest to change the field name to something less generic, maybe
> 'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
> more checks are needed before particular attribute can actually be used in
> UniqueKey.

I don't think so, UniqueKey is just one of the places to use this
not-null property, see 3af704098 for the another user case of it.

(Because of 3af704098, we should leverage notnullattnums somehow in this
patch, which will be included in the next version as well).

> * add_uniquekey_for_uniqueindex()
>
> I'd appreciate an explanation why the "single-row UK" is created. I think
> the reason for unique_exprs==NIL is that a restriction clause VAR=CONST
> exists for each column of the unique index. Whether I'm right or not, a
> comment should state clearly what the reason is.

You are understanding it correctly. I will add comments in the next
version.

>
> * uniquekey_useful_for_merging()
>
> How does uniqueness relate to merge join? In README.uniquekey you seem to
> point out that a single row is always sorted, but I don't think this
> function is related to that fact. (Instead, I'd expect that pathkeys are
> added to all paths for a single-row relation, but I'm not sure you do that
> in the current version of the patch.)

The merging is for "mergejoinable join clauses", see function
eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";

> * is_uniquekey_useful_afterjoin()
>
> Now that my patch (0004) allows propagation of the unique keys from a
> subquery to the upper query, I was wondering if the UniqueKey structure
> needs the 'use_for_distinct field' I mean we should now propagate the unique
> keys to the parent query whether the subquery has DISTINCT clause or not. I
> noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
> changed the function to always returned true. However nothing changed in the
> output of regression tests (installcheck). Do you insist that the
> 'use_for_distinct' field is needed?
>
>
> * uniquekey_contains_multinulls()
>
> ** Instead of calling the function when trying to use the UK, how about
> checking the ECs when considering creation of the UK? If the tests fail,
> just don't create the UK.

I don't think so since we maintain the UniqueKey from bottom to top, you
can double check if my reason is appropriate.

CREATE TABLE t1(a int);
CREATE INDEX ON t1(a);

SELECT distinct t1.a FROM t1 JOIN t2 using(a);

We need to create the UniqueKey on the baserel for t1 and the NULL
values is filtered out in the joinrel. so we have to creating it with
allowing NULL values first.

> ** What does the 'multi' word in the function name mean?

multi means multiple, I thought we use this short name in the many
places, for ex bt_multi_page_stats after a quick search.

> * relation_is_distinct_for()
>
> The function name is too similar to rel_is_distinct_for(). I think the name
> should indicate that you are checking the relation against a set of
> pathkeys. Maybe rel_is_distinct_for_pathkeys() (and remove 'distinct' from
> the argument name)? At the same time, it might be good to rename
> rel_is_distinct_for() to rel_is_distinct_for_clauses().

OK.

> * uniquekey_contains_in()
>
> Shouldn't this be uniquekey_contained_in()? And likewise, shouldn't the
> comment be " ... if UniqueKey is contained in the list of EquivalenceClass"
> ?

OK.
>
> (In general, even though I'm not English native speaker, I think I see quite
> a few grammar issues, which often make reading of the comments/documentation

Your English is really good:)

>
>
> * Combining the UKs
>
> IMO this is the most problematic part of the patch. You call
> populate_joinrel_uniquekeys() for the same join multiple times,

Why do you think so? The below code is called in "make_join_rel"

populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, ...);

so it should be only called once per joinrel.

Is your original question is about populate_joinrel_uniquekey_for_rel
rather than populate_joinrel_uniquekeys? We have the below codes:

outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, outerrel,
innerrel, restrictlist);
inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, innerrel,
outerrel, restrictlist);

This is mainly because of the following theory. Quoted from
README.uniquekey. Let's called this as "rule 1".

"""
How to maintain the uniquekey?
-------------------------------
.. From the join relation, it is maintained with two rules:

- the uniquekey in one side is still unique if it can't be duplicated
after the join. for example:

SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk;
UniqueKey on t1: t1.pk
UniqueKey on t1 Join t2: t1.pk
"""

AND the blow codes:

if (outeruk_still_valid || inneruk_still_valid)

/*
* the uniquekey on outers or inners have been added into joinrel so
* the combined uniuqekey from both sides is not needed.
*/
return;

We don't create the component uniquekey if any one side of the boths
sides is unique already. For example:

"(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
unique", there is no need to create component UniqueKey (t1.id, t2.id);

> each time
> with a different 'restrictlist', and you try to do two things at the same
> time: 1) combine the UKs of the input relations into the UKs of the join
> relation, 2) check if the join relation can be marked single-row.
>
> I think that both 1) and 2) should be independent from join order, and thus
> both computations should only take place once for given set of input
> relations. And I think they should be done separately:
>
> 1) Compute the join UKs
>
> As you admit in a comment in populate_joinrel_uniquekeys(), neither join
> method nor clauses should matter. So I think you only need to pick the
> "component UKs" (i.e. UKs of the input relations) which are usable above
> that join (i.e. neither the join itself nor any join below sets any column
> of the UK to NULL) and combine them.

We need to do this only after the "if (!outeruk_still_valid &&
!inneruk_still_valid)" check, as explained above.

>
> Of course one problem is that the number of combinations can grow
> exponentially as new relations are joined.

Yes, that's why "rule 1" needed and "How to reduce the overhead" in
UniqueKey.README is introduced.

>
> 2) Check if the join relation is single-row
>
> I in order to get rid of the dependency on 'restrictlist', I think you can
> use ECs. Consider a query from your regression tests:
>
> CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d int, e int);
>
> SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1;
>
> The idea here seems to be that no more than one row of t1 matches the query
> clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that
> no more than one row of t2 matches the query (because t1 cannot provide the
> clause with more than one input value of 'e'). And therefore, the join also
> produces at most one row.

You are correct and IMO my current code are able to tell it is a single
row as well.

1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
consequence.
2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
after the join because of rule 1 on joinrel. and t1 is singlerow, so the
joinrel is singlerow as well.

I'm interested with "get rid of the dependency on 'restrictlist', I
think you can use ECs.", let's see what we can improve.
>
> My theory is that relation is single-row if it has an UK such that each of
> its ECs meets at least one of the following conditions:
>
> a) contains a constant

True.
>
> b) contains a column of a relation which has already been proven single-row.

True, not sure if it is easy to tell.

> b) is referenced by an UK of a relation which has already been proven
> single-row.

I can't follow here...

>
> I think that in the example above, an EC {t1.e, t2.id} should exist. So when
> checking whether 't2' is single-row, the condition b) cam be ised: the UK of
> 't2' should reference the EC {t1.e, t2.id}, which in turn contains the
> column t1.e. And 't1' is unique because its EC meets the condition a). (Of
> course you need to check em_jdomain before you use particular EM.)

I think the existing rule 1 for joinrel works well with the singlerow
case naturally, what can be improved if we add the theory you suggested
here?

--
Best Regards
Andy Fan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-05-06 09:00:00 Test equivclass interferes with tests tsearch and advisory_lock
Previous Message Anton A. Melnikov 2024-05-06 06:05:38 Re: psql: fix variable existence tab completion