Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943

From: Tender Wang <tndrwang(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: 1026592243(at)qq(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18377: Assert false in "partdesc->nparts >= pinfo->nparts", fileName="execPartition.c", lineNumber=1943
Date: 2024-04-23 11:14:04
Message-ID: CAHewXNnpxy6rMNvBGZpTdgLosNTpEmZOzth6_m57kcU3kE4kTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tender Wang <tndrwang(at)gmail(dot)com> 于2024年4月18日周四 20:13写道:

>
>
> Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> 于2024年4月9日周二 01:57写道:
>
>> On 2024-Mar-05, PG Bug reporting form wrote:
>>
>> > #2 0x0000000000b8748d in ExceptionalCondition (conditionName=0xd25358
>> > "partdesc->nparts >= pinfo->nparts", fileName=0xd24cfc
>> "execPartition.c",
>> > lineNumber=1943) at assert.c:66
>> > #3 0x0000000000748bf1 in CreatePartitionPruneState
>> (planstate=0x1898ad0,
>> > pruneinfo=0x1884188) at execPartition.c:1943
>> > #4 0x00000000007488cb in ExecInitPartitionPruning (planstate=0x1898ad0,
>> > n_total_subplans=2, pruneinfo=0x1884188,
>> > initially_valid_subplans=0x7ffdca29f7d0) at execPartition.c:1803
>>
>> I had been digging into this crash in late March and seeing if I could
>> find a reliable fix, but it seems devilish and had to put it aside. The
>> problem is that DETACH CONCURRENTLY does a wait for snapshots to
>> disappear before doing the next detach phase; but since pgbench is using
>> prepared mode, the wait is already long done by the time EXECUTE wants
>> to run the plan. Now, we have relcache invalidations at the point where
>> the wait ends, and those relcache invalidations should in turn cause the
>> prepared plan to be invalidated, so we would get a new plan that
>> excludes the partition being detached. But this doesn't happen for some
>> reason that I haven't yet been able to understand.
>>
>> Still trying to find a proper fix. In the meantime, not using prepared
>> plans should serve to work around the problem.
>>
>> --
>> Álvaro Herrera PostgreSQL Developer —
>> https://www.EnterpriseDB.com/
>> "The ability of users to misuse tools is, of course, legendary" (David
>> Steele)
>> https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net
>>
>>
>>
> I had been analying this crash these days. And I added a lot debug infos
> in codes.
> Finally, I found a code execution sequence that would trigger this assert,
> and I could
> use gdb not pgbench to help to reproduce this crash.
>
> For example:
> ./psql postgres # as session1 to do detach, start first
> in another terminal, start gdb(call gdb1)
> gdb -p session1_pid
> b ATExecDetachPartition
>
> in session1, input alter table p detach partition p1 concurrently;
> now session1 will be stalled by gdb.
>
> in gdb terminal, we input step next(e.g. n) until first transaction call
> CommitTransactionCommand().
> wo stop at CommitTransactionCommand().
>
> we start another session2 to do select.
> input : prepare p1 as select * from p where a = $1;
>
> we start a new terminal, start gdb(call gdb2)
> gdb -p session2_pid
> b exec_simple_query
> in session2, input execute p1(1);
> Now session2 will be stalled by gdb.
>
> in gdb terminal, we step into PortalRunUtility(), after getting a
> snapshot, we stop here.
> For session2, the transaction updating pg_inherits is not commited.
> We switch to gdb1 terminal, and continue to step next until calling
> DetachPartitionFinalize().
> Because session2 has not get p relaiton lock, so in gdb1, we can cross
> WaitForLockersMultiple().
>
> Now we swithch to gdb2, and continue to do work. If we breakpoint
> find_inheritance_children_extended()
> We will get a tuple that inhdetachpending is true, but the xmin is
> in-progress for the session2 snapshot.
> So this tuple will be added to the outpue according to the logic. Finally
> we will get two parts.
> After return from add_base_rels_to_query() in query_planner(), we switch
> to gdb1.
>
> In gdb1, we enter DetachPartitionFinalize() and call RemoveInheritance()
> to remove the tuple.
> We input command "continue" to do left work for the detach.
>
> Now we switch to gdb2, breakpoint at RelationCacheInvalidateEntry(). We
> continue gdb2, and we will
> stop at RelationCacheInvalidateEntry(). And we will see that p relation
> cache item will be cleared.
> The backtrace will be attached at the end of the this email.
>
> Entering ExecInitAppend(), because part_prune_info is not null, so we will
> enter CreatePartitionPruneState().
> We enter find_inheritance_children_extended() again to get partdesc, but
> in gdb1 we have done DetachPartitionFinalize()
> and the detach has commited. So we only get one tuple and parts is 1.
>
> Finally, we will trigger the Assert: (partdesc->nparts >= pinfo->nparts).
>
>
> --
> Tender Wang
> OpenPie: https://en.openpie.com/
>

Sorry, I forgot to put backtrace that call RelationCacheInvalidateEntry()
in planner phase in last email.

I found one self-contradiction comments in CreatePartitionPruneState():

/* For data reading, executor always omits detached partitions */
if (estate->es_partition_directory == NULL)
estate->es_partition_directory =
CreatePartitionDirectory(estate->es_query_cxt, false);

Should it be " not omits" if I didn't misunderstand. Because we pass false
to the function.

I think if we could rewrite logic of CreatePartitionPruneState() as below:
if (partdesc->nparts == pinfo->nparts)
{
/* no new partition and no detached partition */
}
else if (partdesc->nparts >= pinfo->nparts)
{
/* new partition */
}
else
{
/* detached partition */
}

I haven't figured out a fix to the Scenario I found in last email.
--
Tender Wang
OpenPie: https://en.openpie.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David G. Johnston 2024-04-23 12:33:00 Re: BUG #15954: Unable to alter partitioned table to set logged
Previous Message PG Bug reporting form 2024-04-23 10:10:25 BUG #18445: date_part / extract range for hours do not match documentation