Re: UniqueKey v2

From: Antonin Houska <ah(at)cybertec(dot)at>
To: zhihuifan1213(at)163(dot)com
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, jian he <jian(dot)universality(at)gmail(dot)com>
Subject: Re: UniqueKey v2
Date: 2024-04-19 11:39:18
Message-ID: 7971.1713526758@antos
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

zhihuifan1213(at)163(dot)com wrote:

> Here is the v3, ...

I'm trying to enhance the join removal functionality (will post my patch in a
separate thread soon) and I consider your patch very helpful in this
area.

Following is my review. Attached are also some fixes and one enhancement:
propagation of the unique keys (UK) from a subquery to the parent query
(0004). (Note that 0001 is just your patch rebased).

* I think that, before EC is considered suitable for an UK, its ec_opfamilies
field needs to be checked. I try to do that in
find_ec_position_matching_expr(), see 0004.

* Set-returning functions (SRFs) can make a "distinct path" necessary even if
the join output is unique.

* RelOptInfo.notnullattrs

My understanding is that this field contains the set attributes whose
uniqueness is guaranteed by the unique key. They are acceptable because they
are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
value makes the containing row filtered out, so the row cannot break
uniqueness of the output. Am I right?

If so, I suggest to change the field name to something less generic, maybe
'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
more checks are needed before particular attribute can actually be used in
UniqueKey.

* add_uniquekey_for_uniqueindex()

I'd appreciate an explanation why the "single-row UK" is created. I think
the reason for unique_exprs==NIL is that a restriction clause VAR=CONST
exists for each column of the unique index. Whether I'm right or not, a
comment should state clearly what the reason is.

* uniquekey_useful_for_merging()

How does uniqueness relate to merge join? In README.uniquekey you seem to
point out that a single row is always sorted, but I don't think this
function is related to that fact. (Instead, I'd expect that pathkeys are
added to all paths for a single-row relation, but I'm not sure you do that
in the current version of the patch.)

* is_uniquekey_useful_afterjoin()

Now that my patch (0004) allows propagation of the unique keys from a
subquery to the upper query, I was wondering if the UniqueKey structure
needs the 'use_for_distinct field' I mean we should now propagate the unique
keys to the parent query whether the subquery has DISTINCT clause or not. I
noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
changed the function to always returned true. However nothing changed in the
output of regression tests (installcheck). Do you insist that the
'use_for_distinct' field is needed?

* uniquekey_contains_multinulls()

** Instead of calling the function when trying to use the UK, how about
checking the ECs when considering creation of the UK? If the tests fail,
just don't create the UK.

** What does the 'multi' word in the function name mean?

* relation_is_distinct_for()

The function name is too similar to rel_is_distinct_for(). I think the name
should indicate that you are checking the relation against a set of
pathkeys. Maybe rel_is_distinct_for_pathkeys() (and remove 'distinct' from
the argument name)? At the same time, it might be good to rename
rel_is_distinct_for() to rel_is_distinct_for_clauses().

* uniquekey_contains_in()

Shouldn't this be uniquekey_contained_in()? And likewise, shouldn't the
comment be " ... if UniqueKey is contained in the list of EquivalenceClass"
?

(In general, even though I'm not English native speaker, I think I see quite
a few grammar issues, which often make reading of the comments/documentation
a bit difficult.)

* Combining the UKs

IMO this is the most problematic part of the patch. You call
populate_joinrel_uniquekeys() for the same join multiple times, each time
with a different 'restrictlist', and you try to do two things at the same
time: 1) combine the UKs of the input relations into the UKs of the join
relation, 2) check if the join relation can be marked single-row.

I think that both 1) and 2) should be independent from join order, and thus
both computations should only take place once for given set of input
relations. And I think they should be done separately:

1) Compute the join UKs

As you admit in a comment in populate_joinrel_uniquekeys(), neither join
method nor clauses should matter. So I think you only need to pick the
"component UKs" (i.e. UKs of the input relations) which are usable above
that join (i.e. neither the join itself nor any join below sets any column
of the UK to NULL) and combine them.

Of course one problem is that the number of combinations can grow
exponentially as new relations are joined. I'm not sure it's necessary to
combine the UKs (and to discard some of them) immediately. Instead, maybe we
can keep lists of UKs only for base relations, and postpone picking the
suitable UKs and combining them until we actually need to check the relation
uniqueness.

2) Check if the join relation is single-row

I in order to get rid of the dependency on 'restrictlist', I think you can
use ECs. Consider a query from your regression tests:

CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d int, e int);

SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1;

The idea here seems to be that no more than one row of t1 matches the query
clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that
no more than one row of t2 matches the query (because t1 cannot provide the
clause with more than one input value of 'e'). And therefore, the join also
produces at most one row.

My theory is that relation is single-row if it has an UK such that each of
its ECs meets at least one of the following conditions:

a) contains a constant

b) contains a column of a relation which has already been proven single-row.

b) is referenced by an UK of a relation which has already been proven
single-row.

I think that in the example above, an EC {t1.e, t2.id} should exist. So when
checking whether 't2' is single-row, the condition b) cam be ised: the UK of
't2' should reference the EC {t1.e, t2.id}, which in turn contains the
column t1.e. And 't1' is unique because its EC meets the condition a). (Of
course you need to check em_jdomain before you use particular EM.)

Are you going to submit the patch to the first CF of PG 18?

Please let me know if I can contribute to the effort by reviewing or writing
some code.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment Content-Type Size
0001-Unique-keys-rebased.patch text/x-diff 50.2 KB
0002-Moved-functions-to-uniquekey.c.patch text/x-diff 4.9 KB
0003-Do-not-use-expression-indexes-to-create-unique-keys.patch text/x-diff 1.2 KB
0004-Teach-the-planner-to-pass-UniqueKey-nodes-from-a-sub.patch text/x-diff 40.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-04-19 11:49:45 Re: Okay to remove mention of mystery @ and ~ operators?
Previous Message Justin Pryzby 2024-04-19 11:34:46 Re: Add SPLIT PARTITION/MERGE PARTITIONS commands