Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943

From: Tender Wang <tndrwang(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: 1026592243(at)qq(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943
Date: 2024-04-18 12:13:10
Message-ID: CAHewXNkaKgVmT+OkVA9UHrEYm+b8J6o_8+-84Qey6V5tM-+z9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> 于2024年4月9日周二 01:57写道:

> On 2024-Mar-05, PG Bug reporting form wrote:
>
> > #2 0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
> > "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc "execPartition.c",
> > lineNumber=1943) at assert.c:66
> > #3 0x0000000000748bf1 in CreatePartitionPruneState (planstate=0x1898ad0,
> > pruneinfo=0x1884188) at execPartition.c:1943
> > #4 0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
> > n_total_subplans=2, pruneinfo=0x1884188,
> > initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803
>
> I had been digging into this crash in late March and seeing if I could
> find a reliable fix, but it seems devilish and had to put it aside. The
> problem is that DETACH CONCURRENTLY does a wait for snapshots to
> disappear before doing the next detach phase; but since pgbench is using
> prepared mode, the wait is already long done by the time EXECUTE wants
> to run the plan. Now, we have relcache invalidations at the point where
> the wait ends, and those relcache invalidations should in turn cause the
> prepared plan to be invalidated, so we would get a new plan that
> excludes the partition being detached. But this doesn't happen for some
> reason that I haven't yet been able to understand.
>
> Still trying to find a proper fix. In the meantime, not using prepared
> plans should serve to work around the problem.
>
> --
> Álvaro Herrera PostgreSQL Developer —
> https://www.EnterpriseDB.com/
> "The ability of users to misuse tools is, of course, legendary" (David
> Steele)
> https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net
>
>
>
I had been analying this crash these days. And I added a lot debug infos
in codes.
Finally, I found a code execution sequence that would trigger this assert,
and I could
use gdb not pgbench to help to reproduce this crash.

For example:
./psql postgres # as session1 to do detach, start first
in another terminal, start gdb(call gdb1)
gdb -p session1_pid
b ATExecDetachPartition

in session1, input alter table p detach partition p1 concurrently;
now session1 will be stalled by gdb.

in gdb terminal, we input step next(e.g. n) until first transaction call
CommitTransactionCommand().
wo stop at CommitTransactionCommand().

we start another session2 to do select.
input : prepare p1 as select * from p where a = $1;

we start a new terminal, start gdb(call gdb2)
gdb -p session2_pid
b exec_simple_query
in session2, input execute p1(1);
Now session2 will be stalled by gdb.

in gdb terminal, we step into PortalRunUtility(), after getting a snapshot,
we stop here.
For session2, the transaction updating pg_inherits is not commited.
We switch to gdb1 terminal, and continue to step next until calling
DetachPartitionFinalize().
Because session2 has not get p relaiton lock, so in gdb1, we can cross
WaitForLockersMultiple().

Now we swithch to gdb2, and continue to do work. If we breakpoint
find_inheritance_children_extended()
We will get a tuple that inhdetachpending is true, but the xmin is
in-progress for the session2 snapshot.
So this tuple will be added to the outpue according to the logic. Finally
we will get two parts.
After return from add_base_rels_to_query() in query_planner(), we switch to
gdb1.

In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance() to
remove the tuple.
We input command "continue" to do left work for the detach.

Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We
continue gdb2, and we will
stop at RelationCacheInvalidateEntry(). And we will see that p relation
cache item will be cleared.
The backtrace will be attached at the end of the this email.

Entering ExecInitAppend(), because part_prune_info is not null, so we will
enter CreatePartitionPruneState().
We enter find_inheritance_children_extended() again to get partdesc, but in
gdb1 we have done DetachPartitionFinalize()
and the detach has commited. So we only get one tuple and parts is 1.

Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).

--
Tender Wang
OpenPie: https://en.openpie.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Костянтин Томах 2024-04-18 13:57:57 Re: BUG #18433: Logical replication timeout
Previous Message 盧致均 (Harry) 2024-04-18 01:23:03 Re: BUG #18428: Connection broken but DB service still alive.