From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used |
Date: | 2023-07-21 19:00:01 |
Message-ID: | 00e24aa6-5d03-99b6-ab41-495a35fc93f6@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hello Tom,
Thank you for your considerations!
21.07.2023 20:20, Tom Lane wrote:
> Alexander Lakhin <exclusion(at)gmail(dot)com> writes:
>> I think that we need to determine the level where the problem that should
>> be fixed is:
>> 1) test xmlmap fails sporadically due to the catalog changes caused by
>> parallel tests activity
>> 2) schema_to_xmlschemaX() can fail when parallel workers are used
>> 3) has_table_privilegeX() can fail sporadically when executed within a
>> parallel worker
>> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
>> when repeated in a parallel worker
> Yeah, that's not immediately obvious. IIUC, the situation we are
> looking at is that SearchSysCacheExists can succeed even though the
> tuple we found is already dead at the instant that the function
> exits (thanks to absorption of inval messages during relation_close).
> The fact that that only happens in parallel workers is pure chance
> really. It is not okay for has_table_privilegeX to depend on the
> fact that the surrounding query already has some lock on pg_class.
> So this means that the approach has_table_privilegeX uses of
> assuming that successful SearchSysCacheExists means it can call
> pg_class_aclcheck without fear is just broken.
>
> If we suppose that that assumption is only being made in the
> has_foo_privilege functions, then one way we could fix it is to extend
> the API of pg_class_aclcheck etc to add a no-error-on-not-found flag,
> and get rid of the separate SearchSysCacheExists check. However,
> I can't avoid the suspicion that we have other places assuming the
> same thing.
Running through SearchSysCacheExistsX() calls, I've found an interesting
(and optimistic) comment in src/backend/catalog/namespace.c:
* ... The underlying IsVisible functions
* always use an up-to-date snapshot and so might see the object as already
* gone when it's still visible to the transaction snapshot. (There is no race
* condition in the current coding because we don't accept sinval messages
* between the SearchSysCacheExists test and the subsequent lookup.)
*/
Datum
pg_table_is_visible(PG_FUNCTION_ARGS)
{
Oid oid = PG_GETARG_OID(0);
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(oid)))
PG_RETURN_NULL();
PG_RETURN_BOOL(RelationIsVisible(oid));
}
...
bool
RelationIsVisible(Oid relid)
{
HeapTuple reltup;
Form_pg_class relform;
Oid relnamespace;
bool visible;
reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
if (!HeapTupleIsValid(reltup))
elog(ERROR, "cache lookup failed for relation %u", relid);
So that's exactly the same coding pattern, thus fixing pg_class_aclcheck
is not an option, apparently.
> So I think what we really ought to be doing is one
> of two things:
>
> 1. Hack SearchSysCacheExists to account for this issue, by making it
> loop if it finds a syscache entry but sees that the entry is already
> dead. (We have to loop, not just return false, in case the row was
> updated rather than deleted.) Maybe all the syscache lookup
> functions need to do likewise; it's certainly not intuitively
> reasonable for them to return already-known-stale entries.
>
> 2. Figure out how come we are executing a cache inval on the way
> out of syscache entry creation, and stop that from happening.
I wrote about LockRelationOid() before, and I still think that invalidation
messages are processed in that call (reached via SearchCatCacheMiss() ->
systable_beginscan() -> index_open() -> relation_open(), not via
relation_close()). So it seems that SearchSysCacheX calls find an entry,
that is not dead, but that entry can be dead (not found) for the next call
if invalidation messages are processed.
> I like #2 better if it's not hard to do cleanly. However, I'm not
> quite sure how we are getting to an inval during relation close;
> maybe that's not something we want to prevent.
Yes, there is a detailed comment in LockRelationOid(), that explains why
AcceptInvalidationMessages() is called. (I've tried to remove that call
now, just for testing, and get 6 tests failed during `make check`.)
Best regards,
Alexander
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2023-07-21 19:21:46 | Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used |
Previous Message | Tom Lane | 2023-07-21 17:20:55 | Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used |