Quick Links

Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

From:	Alexander Lakhin <exclusion(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date:	2023-07-21 19:00:01
Message-ID:	00e24aa6-5d03-99b6-ab41-495a35fc93f6@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hello Tom,

Thank you for your considerations!

21.07.2023 20:20, Tom Lane wrote:
> Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
>> I think that we need to determine the level where the problem that should
>> be fixed is:
>> 1) test xmlmap fails sporadically due to the catalog changes caused by
>> parallel tests activity
>> 2) schema_to_xmlschemaX() can fail when parallel workers are used
>> 3) has_table_privilegeX() can fail sporadically when executed within a
>> parallel worker
>> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
>> when repeated in a parallel worker
> Yeah, that's not immediately obvious. IIUC, the situation we are
> looking at is that SearchSysCacheExists can succeed even though the
> tuple we found is already dead at the instant that the function
> exits (thanks to absorption of inval messages during relation_close).
> The fact that that only happens in parallel workers is pure chance
> really. It is not okay for has_table_privilegeX to depend on the
> fact that the surrounding query already has some lock on pg_class.
> So this means that the approach has_table_privilegeX uses of
> assuming that successful SearchSysCacheExists means it can call
> pg_class_aclcheck without fear is just broken.
>
> If we suppose that that assumption is only being made in the
> has_foo_privilege functions, then one way we could fix it is to extend
> the API of pg_class_aclcheck etc to add a no-error-on-not-found flag,
> and get rid of the separate SearchSysCacheExists check. However,
> I can't avoid the suspicion that we have other places assuming the
> same thing.

Running through SearchSysCacheExistsX() calls, I've found an interesting
(and optimistic) comment in src/backend/catalog/namespace.c:

* ... The underlying IsVisible functions
* always use an up-to-date snapshot and so might see the object as already
* gone when it's still visible to the transaction snapshot. (There is no race
* condition in the current coding because we don't accept sinval messages
* between the SearchSysCacheExists test and the subsequent lookup.)
*/

Datum
pg_table_is_visible(PG_FUNCTION_ARGS)
{
Oid oid = PG_GETARG_OID(0);

if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(oid)))
PG_RETURN_NULL();

PG_RETURN_BOOL(RelationIsVisible(oid));
}

...

bool
RelationIsVisible(Oid relid)
{
    HeapTuple   reltup;
    Form_pg_class relform;
    Oid         relnamespace;
    bool        visible;

    reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
    if (!HeapTupleIsValid(reltup))
        elog(ERROR, "cache lookup failed for relation %u", relid);

So that's exactly the same coding pattern, thus fixing pg_class_aclcheck
is not an option, apparently.

> So I think what we really ought to be doing is one
> of two things:
>
> 1. Hack SearchSysCacheExists to account for this issue, by making it
> loop if it finds a syscache entry but sees that the entry is already
> dead. (We have to loop, not just return false, in case the row was
> updated rather than deleted.) Maybe all the syscache lookup
> functions need to do likewise; it's certainly not intuitively
> reasonable for them to return already-known-stale entries.
>
> 2. Figure out how come we are executing a cache inval on the way
> out of syscache entry creation, and stop that from happening.

I wrote about LockRelationOid() before, and I still think that invalidation
messages are processed in that call (reached via SearchCatCacheMiss() ->
systable_beginscan() -> index_open() -> relation_open(), not via
relation_close()). So it seems that SearchSysCacheX calls find an entry,
that is not dead, but that entry can be dead (not found) for the next call
if invalidation messages are processed.

> I like #2 better if it's not hard to do cleanly. However, I'm not
> quite sure how we are getting to an inval during relation close;
> maybe that's not something we want to prevent.

Yes, there is a detailed comment in LockRelationOid(), that explains why
AcceptInvalidationMessages() is called. (I've tried to remove that call
now, just for testing, and get 6 tests failed during `make check`.)

Best regards,
Alexander

In response to

Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used at 2023-07-21 17:20:55 from Tom Lane

Responses

Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used at 2023-07-21 19:21:46 from Tom Lane

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2023-07-21 19:21:46	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Previous Message	Tom Lane	2023-07-21 17:20:55	Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used