Re: BUG #15032: Segmentation fault when running a particular query

From: Guo Xiang Tan <gxtan1990(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: BUG #15032: Segmentation fault when running a particular query
Date: 2018-01-26 23:46:23
Message-ID: CAEL+R-C0N+hM2OtE_s9xezy15w+18LOxgnYkjHh63SAi+3NxRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> Unsurprisingly, the given info is not enough to reproduce the crash.

We could only reproduce this on our production PostgreSQL cluster. I took a
full pg_dump and restored into a PostgreSQL cluster (10.1) locally but
could not reproduce the seg fault. If it helps, the data in the cluster was
restored via pg_dump (10.1) from a PostgreSQL 9.5 cluster.

On Sat, Jan 27, 2018 at 2:45 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> =?utf-8?q?PG_Bug_reporting_form?= <noreply(at)postgresql(dot)org> writes:
> > ## Query that results in segmentation fault
>
> Unsurprisingly, the given info is not enough to reproduce the crash.
> However, looking at the stack trace:
>
> > (gdb) bt
> > #0 index_markpos (scan=0x0) at
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/backend/access/index/indexam.c:373
> > #1 0x000055a812746c68 in ExecMergeJoin (pstate=0x55a8131bc778) at
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/backend/executor/nodeMergejoin.c:1188
> > #2 0x000055a81272cf3f in ExecProcNode (node=0x55a8131bc778) at
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/include/executor/executor.h:250
> > #3 EvalPlanQualNext (epqstate=epqstate(at)entry=0x55a81318c518) at
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/backend/executor/execMain.c:3005
> > #4 0x000055a81272d342 in EvalPlanQual (estate=estate(at)entry=
> 0x55a81318c018,
> > epqstate=epqstate(at)entry=0x55a81318c518,
> > relation=relation(at)entry=0x7f4e4e25ab68, rti=1, lockmode=<optimized out>,
> > tid=tid(at)entry=0x7ffd54492330, priorXmax=8959603) at
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/backend/executor/execMain.c:2521
> > #5 0x000055a812747af7 in ExecUpdate (mtstate=mtstate(at)entry=
> 0x55a81318c468,
> > tupleid=tupleid(at)entry=0x7ffd54492450, oldtuple=oldtuple(at)entry=0x0,
> > slot=<optimized out>, slot(at)entry=0x55a8131a2f08,
> > planSlot=planSlot(at)entry=0x55a81319db60,
> > epqstate=epqstate(at)entry=0x55a81318c518, estate=0x55a81318c018,
> canSetTag=1
> > '\001') at
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/backend/executor/nodeModifyTable.c:1113
>
> it seems fairly clear that somebody passed a NULL scandesc pointer
> to index_markpos. Looking at the only two callers of that function,
> this must mean that either an IndexScan's node->iss_ScanDesc or an
> IndexOnlyScan's node->ioss_ScanDesc was null. (We don't see
> ExecIndexMarkPos in the trace because the compiler optimized the tail
> call.) And that leads me to commit 09529a70b, which changed the logic
> in those node types to allow initialization of the index scandesc to be
> delayed to the first tuple fetch, rather than necessarily performed during
> ExecInitNode.
>
> Because this is happening inside an EvalPlanQual, it's unsurprising
> that we'd be taking an unusual code path. I believe what happened
> was that the IndexScan node returned a jammed-in EPQ tuple on its
> first call, and so hadn't opened the scandesc at all, while
> ExecMergeJoin would do an ExecMarkPos if the tuple matched (which
> it typically would if we'd gotten to EPQ), whereupon kaboom.
>
> It's tempting to think that this is an oversight in commit 09529a70b
> and we need to rectify it by something along the lines of teaching
> ExecIndexMarkPos and ExecIndexOnlyMarkPos to initialize the scandesc
> if needed before calling index_markpos.
>
> However, on further reflection, it seems like this is a bug of far
> older standing, to wit that ExecIndexMarkPos/ExecIndexRestrPos
> are doing entirely the wrong thing when EPQ is active. It's not
> meaningful, or at least not correct, to be messing with the index
> scan state at all in that case. Rather, what the scan is supposed
> to do is return the single jammed-in EPQ tuple, and what "restore"
> ought to mean is "clear my es_epqScanDone flag so that that tuple
> can be returned again".
>
> It's not clear to me whether the failure to do that has any real
> consequences though. It would only matter if there's more than
> one tuple available on the outer side of the mergejoin, which
> I think there never would be in an EPQ situation. Still, if there
> ever were more outer tuples, the mergejoin would misbehave and maybe
> even crash itself (because it probably assumes that restoring to
> a point where there had been a tuple would allow it to re-fetch
> that tuple successfully).
>
> So what I'm inclined to do is teach the mark/restore infrastructure
> to do the right thing with EPQ state when EPQ is active. But I'm not
> clear on whether that needs to be back-patched earlier than v10.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2018-01-26 23:47:41 Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Previous Message Andres Freund 2018-01-26 23:45:37 Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop