Re: Removing unneeded self joins

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Richard Guo <guofenglinux(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, "Gregory Stark (as CFM)" <stark(dot)cfm(at)gmail(dot)com>, Michał Kłeczek <michal(at)kleczek(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Removing unneeded self joins
Date: 2024-07-24 11:07:47
Message-ID: CAEZATCUwwvoaRuK+UuFF+S49KzKA3uMibjj7qUCOw_iSdtrZnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 20 Jul 2024 at 12:39, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>
> > The new function replace_relid() looks to be the same as adjust_relid_set().
>
> They are similar, not the same. replace_relid() has handling for
> negative newId, while adjust_relid_set() hasn't. One thing I'd like
> to borrow from adjust_relid_set() to replace_relid() is the usage of
> IS_SPECIAL_VARNO() macro.

Ah, that makes sense. In that case, I'd say that replace_relid()
should go in analyzejoins.c (and be a local function there), since
that's the only place that requires this special negative newId
handling.

> It would be probably nice to move this logic into bms_replace_member()
> residing at bitmapset.c. What do you think?

Maybe. It feels a little specialised though, so maybe it's not worth the effort.

I have been reviewing more of the patch, mainly focusing on the logic
in analyzejoins.c that decides when to apply SJE.

I understand broadly what the code is doing, but I still find it
somewhat hard to follow. One thing that makes it hard is that in
analyzejoins.c, "inner" and "outer" get swapped round at various
points. For example generate_join_implied_equalities() is defined like
this:

List *
generate_join_implied_equalities(PlannerInfo *root,
Relids join_relids,
Relids outer_relids,
RelOptInfo *inner_rel,
SpecialJoinInfo *sjinfo);

but remove_self_joins_one_group() calls it like this:

restrictlist = generate_join_implied_equalities(root, joinrelids,
inner->relids,
outer, NULL);

So you have to remember that "inner" is "outer" and "outer" is "inner"
when going into generate_join_implied_equalities() from
remove_self_joins_one_group(). And the same thing happens when calling
innerrel_is_unique_ext() and match_unique_clauses(). I think all that
could be resolved by swapping "inner" and "outer" in the variable
names and comments in remove_self_joins_one_group().

Another thing I noticed in remove_self_joins_one_group() was this:

/*
* To enable SJE for the only degenerate case without any self
* join clauses at all, add baserestrictinfo to this list. The
* degenerate case works only if both sides have the same clause.
* So doesn't matter which side to add.
*/
selfjoinquals = list_concat(selfjoinquals, outer->baserestrictinfo);

That appears to be pointless, because is_innerrel_unique_for() will
filter the restrictlist it is given, removing those baserestrictinfo
clauses (because I think they'll always have can_join = false). And
then relation_has_unique_index_ext() will re-add them:

/*
* Examine the rel's restriction clauses for usable var = const clauses
* that we can add to the restrictlist.
*/
foreach(ic, rel->baserestrictinfo)
{
... add suitable clauses
}

where "rel" is "innerrel" from is_innerrel_unique_for(), which is
"outer" from remove_self_joins_one_group(), so it's the same set of
baserestrictinfo clauses.

Something else that looks a little messy is this in innerrel_is_unique_ext():

/*
* innerrel_is_unique_ext
* Do the same as innerrel_is_unique(), but also set to '*extra_clauses'
* additional clauses from a baserestrictinfo list that were used to prove
* uniqueness. A non NULL 'extra_clauses' indicates that we're checking
* for self-join and correspondingly dealing with filtered clauses.
*/
bool
innerrel_is_unique_ext(PlannerInfo *root,
...
List **extra_clauses)
{
bool self_join = (extra_clauses != NULL);

[logic depending on self_join]
}

This presumes that any caller interested in knowing the extra
baserestrictinfo clauses used to prove uniqueness must be looking at a
self join. That may be true today, but it doesn't seem like a good API
design choice. I think it would be better to just add "self_join" as
an extra parameter, and also maybe have the function return the
UniqueRelInfo containing the "extra_clauses", or NULL if it's not
unique. That way, it would be more extensible, if we wanted it to
return more information in the future.

Instead of adding relation_has_unique_index_ext(), maybe it would be
OK to just change the signature of relation_has_unique_index_for(). It
looks like it's only called from one other place outside
analyzejoins.c. Perhaps the same is true for innerrel_is_unique_ext().

Should match_unique_clauses() be comparing mergeopfamilies or opnos to
ensure that the clauses are using the same equality operator?

Regards,
Dean

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrey M. Borodin 2024-07-24 11:08:01 Re: [PATCH] GROUP BY ALL
Previous Message Alvaro Herrera 2024-07-24 10:57:41 PG buildfarm member cisticola