Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date: 2023-07-21 19:00:01
Message-ID: 00e24aa6-5d03-99b6-ab41-495a35fc93f6@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello Tom,

Thank you for your considerations!

21.07.2023 20:20, Tom Lane wrote:
> Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
>> I think that we need to determine the level where the problem that should
>> be fixed is:
>> 1) test xmlmap fails sporadically due to the catalog changes caused by
>>  parallel tests activity
>> 2) schema_to_xmlschemaX() can fail when parallel workers are used
>> 3) has_table_privilegeX() can fail sporadically when executed within a
>>  parallel worker
>> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
>>  when repeated in a parallel worker
> Yeah, that's not immediately obvious. IIUC, the situation we are
> looking at is that SearchSysCacheExists can succeed even though the
> tuple we found is already dead at the instant that the function
> exits (thanks to absorption of inval messages during relation_close).
> The fact that that only happens in parallel workers is pure chance
> really. It is not okay for has_table_privilegeX to depend on the
> fact that the surrounding query already has some lock on pg_class.
> So this means that the approach has_table_privilegeX uses of
> assuming that successful SearchSysCacheExists means it can call
> pg_class_aclcheck without fear is just broken.
>
> If we suppose that that assumption is only being made in the
> has_foo_privilege functions, then one way we could fix it is to extend
> the API of pg_class_aclcheck etc to add a no-error-on-not-found flag,
> and get rid of the separate SearchSysCacheExists check. However,
> I can't avoid the suspicion that we have other places assuming the
> same thing.

Running through SearchSysCacheExistsX() calls, I've found an interesting
(and optimistic) comment in src/backend/catalog/namespace.c:

 * ...  The underlying IsVisible functions
 * always use an up-to-date snapshot and so might see the object as already
 * gone when it's still visible to the transaction snapshot.  (There is no race
 * condition in the current coding because we don't accept sinval messages
 * between the SearchSysCacheExists test and the subsequent lookup.)
 */

Datum
pg_table_is_visible(PG_FUNCTION_ARGS)
{
    Oid         oid = PG_GETARG_OID(0);

    if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(oid)))
        PG_RETURN_NULL();

    PG_RETURN_BOOL(RelationIsVisible(oid));
}

...

bool
RelationIsVisible(Oid relid)
{
    HeapTuple   reltup;
    Form_pg_class relform;
    Oid         relnamespace;
    bool        visible;

    reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
    if (!HeapTupleIsValid(reltup))
        elog(ERROR, "cache lookup failed for relation %u", relid);

So that's exactly the same coding pattern, thus fixing pg_class_aclcheck
is not an option, apparently.

> So I think what we really ought to be doing is one
> of two things:
>
> 1. Hack SearchSysCacheExists to account for this issue, by making it
> loop if it finds a syscache entry but sees that the entry is already
> dead. (We have to loop, not just return false, in case the row was
> updated rather than deleted.) Maybe all the syscache lookup
> functions need to do likewise; it's certainly not intuitively
> reasonable for them to return already-known-stale entries.
>
> 2. Figure out how come we are executing a cache inval on the way
> out of syscache entry creation, and stop that from happening.

I wrote about LockRelationOid() before, and I still think that invalidation
messages are processed in that call (reached via SearchCatCacheMiss() ->
systable_beginscan() -> index_open() -> relation_open(), not via
relation_close()). So it seems that SearchSysCacheX calls find an entry,
that is not dead, but that entry can be dead (not found) for the next call
if invalidation messages are processed.

> I like #2 better if it's not hard to do cleanly. However, I'm not
> quite sure how we are getting to an inval during relation close;
> maybe that's not something we want to prevent.

Yes, there is a detailed comment in LockRelationOid(), that explains why
AcceptInvalidationMessages() is called. (I've tried to remove that call
now, just for testing, and get 6 tests failed during `make check`.)

Best regards,
Alexander

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2023-07-21 19:21:46 Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Previous Message Tom Lane 2023-07-21 17:20:55 Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used