Re: [PATCH] Erase the distinctClause if the result is unique by definition

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Erase the distinctClause if the result is unique by definition
Date: 2020-02-08 07:22:47
Message-ID: CAKU4AWpe=mNMy_xs0JMQ=3_SJqAku4RiSZiGEuUwsThHPmUqKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Ashutosh:
Thanks for your time.

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
wrote:

> Hi Andy,
> What might help is to add more description to your email message like
> giving examples to explain your idea.
>
> Anyway, I looked at the testcases you added for examples.
> +create table select_distinct_a(a int, b char(20), c char(20) not null,
> d int, e int, primary key(a, b));
> +set enable_mergejoin to off;
> +set enable_hashjoin to off;
> +-- no node for distinct.
> +explain (costs off) select distinct * from select_distinct_a;
> + QUERY PLAN
> +-------------------------------
> + Seq Scan on select_distinct_a
> +(1 row)
>
> From this example, it seems that the distinct operation can be dropped
> because (a, b) is a primary key. Is my understanding correct?
>

Yes, you are correct. Actually I added then to commit message,
but it's true that I should have copied them in this email body as
well. so copy it now.

[PATCH] Erase the distinctClause if the result is unique by
definition

For a single relation, we can tell it by any one of the following
is true:
1. The pk is in the target list.
2. The uk is in the target list and the columns is not null
3. The columns in group-by clause is also in the target list

for relation join, we can tell it by:
if every relation in the jointree yields a unique result set, then
the final result is unique as well regardless the join method.
for semi/anti join, we will ignore the righttable.

I like the idea since it eliminates one expensive operation.
>
> However the patch as presented has some problems
> 1. What happens if the primary key constraint or NOT NULL constraint gets
> dropped between a prepare and execute? The plan will no more be valid and
> thus execution may produce non-distinct results.
>

Will this still be an issue if user use doesn't use a "read uncommitted"
isolation level? I suppose it should be ok for this case. But even though
I should add an isolation level check for this. Just added that in the
patch
to continue discussing of this issue.

> PostgreSQL has similar concept of allowing non-grouping expression as part
> of targetlist when those expressions can be proved to be functionally
> dependent on the GROUP BY clause. See check_functional_grouping() and its
> caller. I think, DISTINCT elimination should work on similar lines.
>
2. For the same reason described in check_functional_grouping(), using
> unique indexes for eliminating DISTINCT should be discouraged.
>

I checked the comments of check_functional_grouping, the reason is

* Currently we only check to see if the rel has a primary key that is a
* subset of the grouping_columns. We could also use plain unique
constraints
* if all their columns are known not null, but there's a problem: we need
* to be able to represent the not-null-ness as part of the constraints
added
* to *constraintDeps. FIXME whenever not-null constraints get represented
* in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able
to get the not null information from relation->rd_attr->attrs[n].attnotnull
which
is just what this patch did.

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
> as well
>

This is a good point. The rules may have some different for join, so I
prefer
to to focus on the current one so far.

> 4. The patch works only at the query level, but that functionality can be
> expanded generally to other places which add Unique/HashAggregate/Group
> nodes if the underlying relation can be proved to produce distinct rows.
> But that's probably more work since we will have to label paths with unique
> keys similar to pathkeys.
>

Do you mean adding some information into PlannerInfo, and when we create
a node for Unique/HashAggregate/Group, we can just create a dummy node?

> 5. Have you tested this OUTER joins, which can render inner side nullable?
>

Yes, that part was missed in the test case. I just added them.

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
>
>> update the patch with considering the semi/anti join.
>>
>> Can anyone help to review this patch?
>>
>> Thanks
>>
>>
>> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
>> wrote:
>>
>>> Hi:
>>>
>>> I wrote a patch to erase the distinctClause if the result is unique by
>>> definition, I find this because a user switch this code from oracle
>>> to PG and find the performance is bad due to this, so I adapt pg for
>>> this as well.
>>>
>>> This patch doesn't work for a well-written SQL, but some drawback
>>> of a SQL may be not very obvious, since the cost of checking is pretty
>>> low as well, so I think it would be ok to add..
>>>
>>> Please see the patch for details.
>>>
>>> Thank you.
>>>
>>
>
> --
> --
> Best Wishes,
> Ashutosh Bapat
>

Attachment Content-Type Size
0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch application/octet-stream 27.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message legrand legrand 2020-02-08 09:54:49 Re: POC: rational number type (fractions)
Previous Message Masahiko Sawada 2020-02-08 05:48:54 Re: Internal key management system