Re: BUG #18634: Wrong varnullingrels with merge ... when not matched by source

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Richard Guo <guofenglinux(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18634: Wrong varnullingrels with merge ... when not matched by source
Date: 2024-09-29 07:49:39
Message-ID: CAEZATCXAJgiKRGpwWuWEyeHCyaGQgyf0UQXxnu0496k5ppzrYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, 28 Sept 2024 at 01:40, Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
>
> On Fri, Sep 27, 2024 at 11:43 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Could use some comments ... but actually, now I'm confused about why
> > any of this is the right thing at all:
> >
> > + * Similarly, any non-target Vars in the join condition will be added to
> > + * the targetlist by preprocess_targetlist(), and so must be marked as
> > + * nullable by the join, for LEFT and FULL joins.
> >
> > Why do we need these Vars in the tlist? If they're for re-evaluating
> > the join condition, isn't the already-nulled form of them the wrong
> > thing?
>

Doh. Yes, that's broken. A previous version of the WHEN NOT MATCHED BY
SOURCE patch used a qual of the form "src IS [NOT] NULL" to
distinguish between the MATCHED and NOT MATCHED BY SOURCE cases, but I
changed that to use the merge join condition as part of a larger
refactoring [1], which was a mistake.

[1] https://www.postgresql.org/message-id/CAEZATCV-7j0wq6T2AGsyRbaVY5muZ8KJ07L-%3D8Pvt%3Da3w1V5vA%40mail.gmail.com

A simple case to demonstrate the problem is this:

CREATE TABLE src (a int, b text);
INSERT INTO src VALUES (1, 'src row');

CREATE TABLE tgt (a int, b text);
INSERT INTO tgt VALUES (NULL, 'tgt row');

MERGE INTO tgt USING src ON tgt.a IS NOT DISTINCT FROM src.a
WHEN MATCHED THEN UPDATE SET a = src.a, b = src.b
WHEN NOT MATCHED BY SOURCE THEN DELETE;

SELECT * FROM tgt;
a | b
---+---
|
(1 row)

which is wrong (it executes the MATCHED UPDATE action instead of the
NOT MATCHED BY SOURCE DELETE action because src.a is NULL above the
join).

The simplest fix is to add a "src IS NOT NULL" wholerow check to the
executor recheck condition, similar to the prior implementation, which
I've done in the v2 patch, attached.

A slightly better fix would be to *replace* the executor check with
that IS NOT NULL test, and not actually recheck the original join
condition at all. That's already sufficient for the initial check at
the top of ExecMergeMatched(), and it would only require a minor
change to the EPQ rechecking further down to make it work fully. It
would, however, make the field names wrong/misleading, so probably
they would need updating if we did that, making it unsuitable for
back-patching. Since this is really only an optimisation, and it's not
clear how much difference it would actually make anyway, I'm inclined
to leave it as a possible future enhancement.

> I have the same concern. I think we should NOT mark the vars in
> mergeJoinCondition as nullable, as mergeJoinCondition acts as join
> quals rather than filter quals at that outer join. Instead, we should
> mark them nullable when they are pulled out and ready to be added to
> the targetlist, if they are really needed in the targetlist.
>

Actually, the marking is done after building the join node, so it's
only marking a copy of the join condition, for use above the join. The
original condition inside the join node remains unmarked, so I think
it's right.

For the sake of the archives, this "wrong varnullingrels" error is not
limited to the join condition. With the same table/view definitions as
in the original report, it can also be reproduced with queries like

MERGE INTO v USING (SELECT * FROM generate_series(1,1)) AS s(a) ON true
WHEN MATCHED THEN UPDATE SET a = s.a
WHEN NOT MATCHED BY SOURCE THEN DELETE;

and

MERGE INTO v USING (SELECT * FROM generate_series(1,1)) AS s(a) ON true
WHEN NOT MATCHED BY SOURCE THEN DELETE
RETURNING s.a;

Basically, anything in the ModifyTable node that accesses source
relation Vars, if the source relation is on the outer side of the
join.

I spent some time trying to figure out why none of the existing tests
hit this error, and I think the reason is that all the previous tests
involved a plan where the ModifyTable node is directly on top of the
join node, so the top targetlist was the join node's targetlist, and
therefore wasn't marked. But in the example here, there is a one-time
filter Result node between the ModifyTable node and the join node,
which means the ModifyTable node pulls from the Result node, whose
output is marked as nullable, because it's above the join. That makes
the error somewhat rare, though maybe there are other cases that can
lead to a plan node being inserted between the ModifyTable node and
the join node.

It feels a bit unsatisfactory that this wasn't detectable with a
ModifyTable node directly on top of the join node, making the bug hard
to spot, but I don't know whether it would be feasible to do anything
about that.

I have split this into 2 commits, since it's really 2 separate bugs,
and tried to improve the commentary.

Regards,
Dean

Attachment Content-Type Size
v2-0001-Fix-incorrect-non-strict-join-recheck-in-MERGE-WH.patch text/x-patch 6.5 KB
v2-0002-Fix-varnullingrels-markings-for-MERGE-WHEN-NOT-MA.patch text/x-patch 5.7 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Craig Milhiser 2024-09-29 23:03:12 Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Previous Message David G. Johnston 2024-09-28 13:43:21 Re: BUG #18639: Unexpected behavior with SET ROLE and CREATE ROLE interaction