Re: BUG #18559: Crash after detaching a partition concurrently from another session

From: Junwang Zhao <zhjwpku(at)gmail(dot)com>
To: Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org, Alvaro Herrera from 2ndQuadrant <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: BUG #18559: Crash after detaching a partition concurrently from another session
Date: 2024-08-10 13:54:58
Message-ID: CAEG8a3+qEsMW_R502GB9szReSGnhUvoNuvLci1wvw07+Lbegiw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jul 30, 2024 at 9:52 PM Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com> wrote:
>
> On Tue, Jul 30, 2024 at 7:18 PM PG Bug reporting form
> <noreply(at)postgresql(dot)org> wrote:
> Adding Alvaro.
> >
> > The following code assumes that an pg_class entry for the detached partition
> > will always be available which is wrong.
> >
> Referring to the following code.
> /*
> * Two problems are possible here. First, a concurrent ATTACH
> * PARTITION might be in the process of adding a new partition, but
> * the syscache doesn't have it, or its copy of it does not yet have
> * its relpartbound set. We cannot just AcceptInvalidationMessages(),
> * because the other process might have already removed itself from
> * the ProcArray but not yet added its invalidation messages to the
> * shared queue. We solve this problem by reading pg_class directly
> * for the desired tuple.
> *
> * The other problem is that DETACH CONCURRENTLY is in the process of
> * removing a partition, which happens in two steps: first it marks it
> * as "detach pending", commits, then unsets relpartbound. If
> * find_inheritance_children_extended included that partition but we
> * below we see that DETACH CONCURRENTLY has reset relpartbound for
> * it, we'd see an inconsistent view. (The inconsistency is seen
> * because table_open below reads invalidation messages.) We protect
> * against this by retrying find_inheritance_children_extended().
> */
> if (boundspec == NULL)
> {
> Relation pg_class;
> SysScanDesc scan;
> ScanKeyData key[1];
> Datum datum;
> bool isnull;
>
> pg_class = table_open(RelationRelationId, AccessShareLock);
> ScanKeyInit(&key[0],
> Anum_pg_class_oid,
> BTEqualStrategyNumber, F_OIDEQ,
> ObjectIdGetDatum(inhrelid));
> scan = systable_beginscan(pg_class, ClassOidIndexId, true,
> NULL, 1, key);
> tuple = systable_getnext(scan);
> datum = heap_getattr(tuple, Anum_pg_class_relpartbound,
> RelationGetDescr(pg_class), &isnull);
> if (!isnull)
> boundspec = stringToNode(TextDatumGetCString(datum));
> systable_endscan(scan);
> table_close(pg_class, AccessShareLock);
>
> IIUC, a simple fix would be to retry if an entry is not found. Attached a patch.

I can reproduce the issue, and the patch LGTM.

>
> --
> Thanks & Regards,
> Kuntal Ghosh

--
Regards
Junwang Zhao

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sahu, Abhisek Kumar 2024-08-12 05:37:50 RE: BUG #18569: Memory leak in Postgres Enterprise server
Previous Message Tomas Vondra 2024-08-09 19:55:00 Re: FDW INSERT batching can change behavior