Re: aggregate crash

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: aggregate crash
Date: 2020-01-15 20:47:47
Message-ID: 20200115204747.yhbvkngltgebrrxz@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-01-14 23:27:02 -0800, Andres Freund wrote:
> On 2020-01-14 17:54:16 -0500, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> > > I'm still not sure I actually fully understand the bug. It's obvious how
> > > returning the input value again could lead to memory not being freed (so
> > > that leak seems to go all the way back). And similarly, since the
> > > introduction of expanded objects, it can also lead to the expanded
> > > object not being deleted.
> > > But that's not the problem causing the crash here. What I think must
> > > instead be the problem is that pergroupstate->transValueIsNull, but
> > > pergroupstate->transValue is set to something looking like a
> > > pointer. Which caused us not to datumCopy() a new transition value into
> > > a long lived context. and then a later transition causes us to free the
> > > short-lived value?
> >
> > Yeah, I was kind of wondering that too. While formally the Datum value
> > for a null is undefined, I'm not aware offhand of any functions that
> > wouldn't return zero --- and this would have to be an aggregate transition
> > function doing so, which reduces the universe of candidates quite a lot.
> > Plus there's the question of how often a transition function would return
> > null for non-null input at all.
> >
> > Could we see a test case that provokes this crash, even if it doesn't
> > do so reliably?
>
> There's a larger reproducer referenced in the first message. I had hoped
> that Teodor could narrow it down - I guess I'll try to do that tomorrow...

FWIW, I'm working on narrowing it down to something small. I can
reliably trigger the bug, and I understand the mechanics, I
think. Interestingly enough the reproducer currently only triggers on
v12, not on v11 and before.

As you say, this requires a transition function returning a NULL that
has the datum part set - the reproducer here defines a non-strict
aggregate transition function that can indirectly do so:

CREATE FUNCTION public.state_max_bytea(st bytea, inp bytea) RETURNS bytea
LANGUAGE plpgsql
AS $$
BEGIN
if st is null
then
return inp;
elseif st<inp then
return inp;
else
return st;
end if;
END;$$;

CREATE AGGREGATE public.max(bytea) (
SFUNC = public.state_max_bytea,
STYPE = bytea
);

I.e. when the current transition is null (e.g. for the first tuple), the
transition is always set to new input value. Even if that is null.

Then the question in turn is, how the input datum is != 0, but has
isnull set. And that's caused by:

EEO_CASE(EEOP_FUNCEXPR_STRICT)
{
FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
NullableDatum *args = fcinfo->args;
int argno;
Datum d;

/* strict function, so check for NULL args */
for (argno = 0; argno < op->d.func.nargs; argno++)
{
if (args[argno].isnull)
{
*op->resnull = true;
goto strictfail;
}
}
fcinfo->isnull = false;
d = op->d.func.fn_addr(fcinfo);
*op->resvalue = d;
*op->resnull = fcinfo->isnull;

strictfail:
EEO_NEXT();
}

I.e. if the transitions argument is a strict function, and that strict
function is not evaluated because of a NULL input, we set op->resnull =
true, but do *not* touch op->resvalue. If there was a previous row that
actually set resvalue to something meaningful, we get an input to the
transition function consisting out of the old resvalue (!= 0), but the
new resnull = true. If the transition function returns that unchanged,
ExecAggTransReparent() doesn't do anything, because the new value is
null. Afterwards pergroup->transValue is set != 0, even though
transValueIsNull = true.

The somewhat tricky bit is arranging this to happen with pointers that
are the same. I think I'm on the way to narrow that down, but it'll take
me a bit longer.

To fix this I think we should set newVal = 0 in
ExecAggTransReparent()'s, as a new else to !newValueIsNull. That should
not add any additional branches, I think. I contrast to always doing so
when checking whether ExecAggTransReparent() ought to be called.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alex Adriaanse 2020-01-15 20:48:08 Re: Corruption with duplicate primary key
Previous Message Bruce Momjian 2020-01-15 20:41:04 Re: Append with naive multiplexing of FDWs