Re: BUG #18559: Crash after detaching a partition concurrently from another session

From: Alvaro Herrera from 2ndQuadrant <alvherre(at)alvh(dot)no-ip(dot)org>
To: Junwang Zhao <zhjwpku(at)gmail(dot)com>
Cc: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, Tender Wang <tndrwang(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18559: Crash after detaching a partition concurrently from another session
Date: 2024-08-12 22:33:09
Message-ID: 202408122233.bo4adt3vh5bi@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 2024-Aug-12, Alvaro Herrera from 2ndQuadrant wrote:

> On 2024-Aug-10, Junwang Zhao wrote:
>
> > > IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.
> >
> > I can reproduce the issue, and the patch LGTM.
>
> Interesting issue, thanks for reporting and putting together a
> reproducer. I have added some comments to the proposed patch, so here's
> a v2 for it. I'm going to write a commit message for it and push to all
> branches since 14.

Dept. of second thoughts. I couldn't find any reason why it's okay to
dereference the return value from systable_getnext() without verifying
that it's not NULL, so I added the check to the older branches too. As
far as I know we've never had a crash report that could be traced to
lack of that check, but that code still looks like it's assuming a
little too much.

One more thing here. Applying the test scripts that I used for the
previous bug with addition of DROP after the DETACH CONCURRENTLY, I get
a different failure during planning, which reports this error:

ERROR: could not open relation with OID 457639
STATEMENT: select * from p where a = $1;

and changing that error (in relation_open) from ERROR to PANIC would
result in this backtrace:

#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo(at)entry=6, no_tid=no_tid(at)entry=0) at ./nptl/pthread_kill.c:44
#1 0x00007f149c3cae8f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2 0x00007f149c37bfb2 in __GI_raise (sig=sig(at)entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007f149c366472 in __GI_abort () at ./stdlib/abort.c:79
#4 0x0000557db57b8611 in errfinish (filename=<optimized out>, lineno=61, funcname=0x557db5816128 <__func__.1> "relation_open")
at ../../../../../../pgsql/source/master/src/backend/utils/error/elog.c:599
#5 0x0000557db52441ad in relation_open (relationId=17466, lockmode=lockmode(at)entry=1)
at ../../../../../../pgsql/source/master/src/backend/access/common/relation.c:61
#6 0x0000557db537d1d9 in table_open (relationId=<optimized out>, lockmode=lockmode(at)entry=1)
at ../../../../../../pgsql/source/master/src/backend/access/table/table.c:44
#7 0x0000557db55b0cf8 in expand_partitioned_rtentry (root=root(at)entry=0x557db9c8b530, relinfo=relinfo(at)entry=0x557db9c8c2d8,
parentrte=parentrte(at)entry=0x557db9c8c178, parentRTindex=parentRTindex(at)entry=1, parentrel=parentrel(at)entry=0x7f149bc46060, parent_updatedCols=0x0,
top_parentrc=0x0, lockmode=1) at ../../../../../../pgsql/source/master/src/backend/optimizer/util/inherit.c:390
#8 0x0000557db55b121f in expand_inherited_rtentry (root=root(at)entry=0x557db9c8b530, rel=0x557db9c8c2d8, rte=0x557db9c8c178, rti=rti(at)entry=1)
at ../../../../../../pgsql/source/master/src/backend/optimizer/util/inherit.c:154
#9 0x0000557db558e11a in add_other_rels_to_query (root=root(at)entry=0x557db9c8b530)
at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/initsplan.c:214
#10 0x0000557db5591983 in query_planner (root=root(at)entry=0x557db9c8b530, qp_callback=qp_callback(at)entry=0x557db5593470 <standard_qp_callback>,
qp_extra=qp_extra(at)entry=0x7ffd987c39e0) at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/planmain.c:268
#11 0x0000557db5597968 in grouping_planner (root=root(at)entry=0x557db9c8b530, tuple_fraction=<optimized out>, tuple_fraction(at)entry=0, setops=setops(at)entry=0x0)
at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/planner.c:1520
#12 0x0000557db559a8f3 in subquery_planner (glob=glob(at)entry=0x557db9c8c758, parse=parse(at)entry=0x557db9c8bf68, parent_root=parent_root(at)entry=0x0,
hasRecursion=hasRecursion(at)entry=false, tuple_fraction=tuple_fraction(at)entry=0, setops=setops(at)entry=0x0)
at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/planner.c:1089
#13 0x0000557db559acea in standard_planner (parse=0x557db9c8bf68, query_string=<optimized out>, cursorOptions=2048, boundParams=0x0)
at ../../../../../../pgsql/source/master/src/backend/optimizer/plan/planner.c:415
#14 0x0000557db5677117 in pg_plan_query (querytree=querytree(at)entry=0x557db9c8bf68, query_string=query_string(at)entry=0x557db9cb34c8 "select * from p where a = $1;",
cursorOptions=cursorOptions(at)entry=2048, boundParams=boundParams(at)entry=0x0) at ../../../../../pgsql/source/master/src/backend/tcop/postgres.c:912
#15 0x0000557db5677273 in pg_plan_queries (querytrees=querytrees(at)entry=0x557db9d69268, query_string=0x557db9cb34c8 "select * from p where a = $1;",
cursorOptions=2048, boundParams=boundParams(at)entry=0x0) at ../../../../../pgsql/source/master/src/backend/tcop/postgres.c:1006
#16 0x0000557db57a39b3 in BuildCachedPlan (plansource=plansource(at)entry=0x557db9c5ff10, qlist=qlist(at)entry=0x557db9d69268, boundParams=boundParams(at)entry=0x0,
queryEnv=queryEnv(at)entry=0x0) at ../../../../../../pgsql/source/master/src/backend/utils/cache/plancache.c:962
#17 0x0000557db57a4204 in GetCachedPlan (plansource=plansource(at)entry=0x557db9c5ff10, boundParams=boundParams(at)entry=0x557db9cc0778, owner=owner(at)entry=0x0,
queryEnv=queryEnv(at)entry=0x0) at ../../../../../../pgsql/source/master/src/backend/utils/cache/plancache.c:1199
#18 0x0000557db56786c5 in exec_bind_message (input_message=0x7ffd987c3e70) at ../../../../../pgsql/source/master/src/backend/tcop/postgres.c:2017
#19 PostgresMain (dbname=<optimized out>, username=<optimized out>) at ../../../../../pgsql/source/master/src/backend/tcop/postgres.c:4814

Clearly this is undesirable, but I'm not sure how strongly we need to
pursue a fix for it. It seems hard to forbid dropping the table after
the detach. Is it enough to advise users to not drop partitions
immediately after detaching them? Is this a sign of a more fundamental
problem in the mechanism for DETACH CONCURRENTLY when used together with
prepared statements?

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees." (E. Dijkstra)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Jason Kim 2024-08-13 04:17:51 a row is not inserted in nested INSERT ON CONFLICT
Previous Message Cameron Vogt 2024-08-12 21:19:22 TLS session tickets disabled?