Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Richard Guo <guofenglinux(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Peter Geoghegan <pg(at)bowt(dot)ie>
Subject: Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger
Date: 2023-07-09 15:00:01
Message-ID: e989408c-1838-61cd-c968-1fcf47c6fbba@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello Tom,

I've found enough time this week to investigate the issue deeper and now
I can answer your questions, hopefully.

30.04.2023 01:24, Tom Lane wrote:
> I am not entirely comfortable with blaming this on 86dc90056 though,
> even though "git bisect" seems to finger that. The previous coding
> there with ExecFilterJunk() also produced a virtual tuple, as you
> can see by examining ExecFilterJunk, so why didn't the problem
> manifest before? I think the answer may be that before 86dc90056,
> we were effectively relying on the underlying SeqScan node to keep
> the buffer pinned, but now we're re-fetching the tuple and no pin
> is held by lower plan nodes. I'm not entirely sure about that though;
> ISTM a SeqScan ought to hold pin on its current tuple in any case.
> However, if the UPDATE plan were more complicated (involving a join)
> then it's quite believable that we'd get here with no relevant pin
> being held.

I couldn't confirm that the issue is specific to a SeqScan plan; it is
reproduced (and fixed by the patch proposed) with a more complex query:
cat << "EOF" | psql
CREATE TABLE bruttest(id int, cnt int DEFAULT 0, t text);
CREATE FUNCTION noop_tfunc() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN
RETURN NEW; END$$;
CREATE TRIGGER brupdate_trigger BEFORE UPDATE ON bruttest FOR EACH ROW
EXECUTE FUNCTION noop_tfunc();
INSERT INTO bruttest(id, t) VALUES (1, repeat('x', 1000));

CREATE TABLE t2(id int, delta int);
INSERT INTO t2 VALUES(1, 1);
EOF

for ((i=1;i<=4;i++)); do
  echo "iteration $i"
  psql -c "UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id" &
  psql -c "UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id" &
  wait
done

---
psql -c "EXPLAIN UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id"
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Update on bruttest  (cost=241.88..485.18 rows=0 width=0)
   ->  Merge Join  (cost=241.88..485.18 rows=13560 width=16)
         Merge Cond: (bruttest.id = t2.id)
         ->  Sort  (cost=83.37..86.37 rows=1200 width=14)
               Sort Key: bruttest.id
               ->  Seq Scan on bruttest  (cost=0.00..22.00 rows=1200 width=14)
         ->  Sort  (cost=158.51..164.16 rows=2260 width=14)
               Sort Key: t2.id
               ->  Seq Scan on t2  (cost=0.00..32.60 rows=2260 width=14)
(9 rows)

IIUC, the buffer is pinned by oldslot only, and newslot uses it without
explicit pinning because of the following call chain:
epqslot_clean = ExecGetUpdateNewTuple(relinfo=relinfo, planSlot=epqslot_candidate, oldSlot=oldslot):
    ProjectionInfo *newProj = relinfo->ri_projectNew;
    ...
    econtext->ecxt_scantuple = oldSlot;
    ...
    ExecProject(projInfo=newProj):
        ExprState  *state = &projInfo->pi_state;
        ...
        ExecEvalExprSwitchContext(state=state, econtext=projInfo->pi_exprContext):
            state->evalfunc(state, econtext, isNull) -> ExecInterpExpr(state, econtext, isNull):
                resultslot = state->resultslot;
                ...
                scanslot = econtext->ecxt_scantuple;
                ...
                EEOP_SCAN_FETCHSOME:
                    slot_getsomeattrs(scanslot, op->d.fetch.last_var);
                ...
                EEOP_ASSIGN_SCAN_VAR:
                    resultslot->tts_values[resultnum] = scanslot->tts_values[attnum];

Note that ExecGetUpdateNewTuple() returns relinfo->ri_projectNew->pi_state.resultslot;
So the only place where the buffer could be pinned for newslot correctly is
EEOP_ASSIGN_SCAN_VAR, but I think that changing something there is not an option.

> The practical impact of that concern is that I'm not convinced that
> the proposed patch fixes all variants of the issue. Maybe we need
> to materialize even if newslot != epqslot_clean.

I investigated all code paths and came to a conclusion that the case
newslot != epqslot_clean is impossible.

1) When ExecBRUpdateTriggers() called via ExecUpdatePrologue(), ...,
ExecModifyTable(), we have:
ExecModifyTable():
    /* Initialize projection info if first time for this table */
    if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
        ExecInitUpdateProjection(node, resultRelInfo);

    slot = ExecGetUpdateNewTuple(resultRelInfo, context.planSlot, oldSlot);
...
    ExecUpdate(..., resultRelInfo=resultRelInfo, ..., slot=slot, ...):
        ExecUpdatePrologue(..., resultRelInfo=resultRelInfo, ... , ..., slot=slot, ...):
            ExecMaterializeSlot(slot);
            ...
            ExecBRUpdateTriggers(..., relinfo=resultRelInfo, ..., newslot=slot, ...)
                epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
(so, epqslot_clean and newslot both are equal to
relinfo->ri_projectNew->pi_state.resultslot)

2) When ExecBRUpdateTriggers() called via ExecMergeMatched(), .. .,
ExecMerge(), epqslot_candidate == NULL always, because of
                    /*
                     * Recheck the tuple using EPQ. For MERGE, we leave this
                     * to the caller (it must do additional rechecking, and
                     * might end up executing a different action entirely).
                     */
inside GetTupleForTrigger().

3) When ExecBRUpdateTriggers() called via ExecSimpleRelationUpdate(), ...,
apply_handle_tuple_routing():
FindReplTupleInLocalRel()
...
ExecSimpleRelationUpdate()
A concurrent update of target tuple prevented by table_tuple_lock() in
RelationFindReplTupleSeq()/RelationFindReplTupleByIndex(), called from
FindReplTupleInLocalRel().

Moreover, if epqslot_candidate != NULL were true when ExecBRUpdateTriggers()
called from ExecSimpleRelationUpdate()/ExecMergeMatched(), then we would
get a crash inside
epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
because of
ExecGetUpdateNewTuple(relinfo, ...)
        ProjectionInfo *newProj = relinfo->ri_projectNew;
...
        Assert(relinfo->ri_projectNewInfoValid);
...
        econtext = newProj->pi_exprContext;

as ExecInitUpdateProjection(), which initializes ri_projectNew, called only
from ExecModifyTable().

So I think it's safe to assume that newslot == epqslot_clean when
epqslot_candidate != NULL, thus ExecCopySlot may be removed.

> Maybe we need to
> do it even if there wasn't a concurrent update.

Without a concurrent update, i. e. when epqslot_candidate == NULL,
newslot is filled earlier, in ExecModifyTable():
    oldSlot = resultRelInfo->ri_oldTupleSlot;
...
    slot = ExecGetUpdateNewTuple(resultRelInfo, context.planSlot, oldSlot);
so newslot uses oldSlot existing in ExecModifyTable(), not oldslot created
in ExecBRUpdateTriggers().

> Maybe we need to
> do it in pre-v14 branches, too.

As I noted in my previous message, before 86dc90056 a different slot was
used when getting "some attrs", so I see no need to fix that state.

> It's certainly not obvious that this
> function ought to assume anything about which slots are pointing at
> what.

I think the problem is in the sequence of calls:

        epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
        trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig);

As ExecGetUpdateNewTuple() can fill attributes in oldslot and
ExecFetchSlotHeapTuple() can release the oldslot's underlying storage (as
the comment for that function says), ExecMaterializeSlot() should be called
in between to save the new tuple contents.

> Also, if we do materialize here, does this code a little further down
> become redundant?
>
> if (should_free_trig && newtuple == trigtuple)
> ExecMaterializeSlot(newslot);

That is another interesting case, and I think it's the separate one. I
could not reproduce that issue (commit 75e03eabea from 2019 mentions
zheap), but I suppose it was real at the moment of the commit, and it was
two years before 86dc90056. So I think that ExecMaterializeSlot() inside
the condition "if (epqslot_candidate != NULL)" would not cover that case.

As an outcome, I propose an improved patch, that:
1) Removes unreachable ExecCopySlot() call;
2) Adds one ExecMaterializeSlot() call, that covers both cases;
3) Removes separate ExecMaterializeSlot() call, that becomes redundant.

It simplifies ExecBRUpdateTriggers() a little, but if it might affect
performance significantly (I don't see how), maybe leave previous
ExecMaterializeSlot() call intact, and add another one just after
epqslot_clean = ExecGetUpdateNewTuple(...);

> A completely different way of looking at it is that we should not
> treat this as ExecBRUpdateTriggers's bug at all. The slot mechanisms
> are supposed to protect the data referenced by a slot, so why is that
> failing to happen in this example? The correct fix might involve
> newslot acquiring a buffer pin, for example.

Unfortunately, I could not find a place where such pin could be acquired
(except for EEOP_ASSIGN_SCAN_VAR at the lower level).

> Bottom line is that I'm afraid the present patch is band-aiding a
> specific problem without solving the general case.

The next one is more generic, but still limited to the ExecBRUpdateTriggers()
bounds, as I still see no problem outside.

Best regards,
Alexander

Attachment Content-Type Size
v3-01-Fix-memory-access-in-BRU-trigger.patch text/x-patch 2.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David Rowley 2023-07-10 02:14:04 Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower
Previous Message Tom Lane 2023-07-09 14:18:30 Re: BUG #18016: REINDEX TABLE failure