Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date: 2023-07-16 20:00:01
Message-ID: b2dcd258-82f1-dfe3-498e-9f652a0c0c4b@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

04.07.2023 14:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference: 18014
> Logged by: Alexander Lakhin
> Email address: exclusion(at)gmail(dot)com
> PostgreSQL version: 16beta2
> Operating system: Ubuntu 22.04
> Description:
>
> Yesterday's test failure on prion:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2023-07-03%2010%3A13%3A03
> made me wonder, what's going on there and whether it's yet another issue
> with invalidating relcache (bug #17994).
> (
> SELECT schema_to_xmlschema('testxmlschema', false, true, '');
> ERROR: relation with OID 29598 does not exist
> CONTEXT: SQL statement "SELECT oid FROM pg_catalog.pg_class WHERE
> relnamespace = 29597 AND relkind IN ('r','m','v') AND
> pg_catalog.has_table_privilege (oid, 'SELECT') ORDER BY relname;"

I investigated this case and would like to share my findings.
I added in has_table_privilege_id(), just below
     if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid)))
         PG_RETURN_NULL();

the following loop:
for (int i = 0; i < 100; i++) {
  bool sce = SearchSysCacheExists1(RELOID, ObjectIdGetDatum(tableoid));
  if (!sce)
    elog(LOG, "has_table_privilege_id(): no syscache entry on iteration %d", i);
    break;
  }
}

and discovered that when the reproducing script uses parallel worker(s), the
syscache entry disappears during this loop execution. But that's not
happening when the query "SELECT oid FROM pg_catalog.pg_class WHERE ..."
is executed in a regular backend.
AFAICS, the difference is in the LockRelationOid():
     res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);

    /*
     * Now that we have the lock, check for invalidation messages, so that we
     * will update or flush any stale relcache entry before we try to use it.
     * RangeVarGetRelid() specifically relies on us for this.  We can skip
     * this in the not-uncommon case that we already had the same type of lock
     * being requested, since then no one else could have modified the
     * relcache entry in an undesirable way.  (In the case where our own xact
     * modifies the rel, the relcache update happens via
     * CommandCounterIncrement, not here.)
     *
     * However, in corner cases where code acts on tables (usually catalogs)
     * recursively, we might get here while still processing invalidation
     * messages in some outer execution of this function or a sibling.  The
     * "cleared" status of the lock tells us whether we really are done
     * absorbing relevant inval messages.
     */
    if (res != LOCKACQUIRE_ALREADY_CLEAR)
    {
        AcceptInvalidationMessages();
        MarkLockClear(locallock);
    }
when LockRelationOid() is called for pg_class_oid_index inside
SearchCatCacheMiss() -> systable_beginscan() -> index_open() -> relation_open().

The parallel worker doesn't have a lock on pg_class_oid_index before
executing the query, so it gets the lock and res == LOCKACQUIRE_OK (not
LOCKACQUIRE_ALREADY_CLEAR as in a regular backend case), after that it
processes invalidation messages (this can make the backend use a newer
catalog snapshot), and at the end it does systable_endscan() ->
index_close() -> UnlockRelationId() -> LockRelease()...
Thus, on a next iteration it gets the lock anew, with the res == LOCKACQUIRE_OK
again, and all that ceremony repeated.

It's not clear to me, whether this parallel worker behavior is expected and
if so, what to fix to avoid the test failure.

Best regards,
Alexander

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2023-07-16 22:04:29 Re: BUG #17949: Adding an index introduces serialisation anomalies.
Previous Message Tomas Vondra 2023-07-16 15:19:49 Re: BUG #17552: pg_stat_statements tracks internal FK check queries when COPY used to load data