Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18014: Releasing catcache entries makes schema_to_xmlschema() fail when parallel workers are used
Date: 2023-10-14 07:00:00
Message-ID: defada9d-8bb1-860f-2682-eee03fdc0ab4@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

13.10.2023 18:00, Alexander Lakhin wrote:

>
>> I spent some time looking through existing SearchSysCacheExists calls,
>> and I could only find two sets of routines where we seem to be
>> depending on SearchSysCacheExists to protect a subsequent lookup
>> somewhere else, and there isn't any lock on the object in question.
>> Those are the has_foo_privilege functions discussed here, and the
>> foo_is_visible functions near the bottom of namespace.c.  I'm not
>> sure why we've not heard complaints traceable to the foo_is_visible
>> family.  Maybe nobody has tried hard to break them, or maybe they
>> are just less likely to be used in ways that are at risk.
>
> I'll try to research/break xxx_is_visible and share my findings tomorrow.
>

I tried the script based on the initial reproducer [1]:
for ((n=1;n<=30;n++)); do
echo "ITERATION $n"

numclients=30
for ((c=1;c<=$numclients;c++)); do
cat << EOF | psql >psql_$c.log &
CREATE SCHEMA testxmlschema_$c;

SELECT format('CREATE TABLE testxmlschema_$c.test_%s (a int);', g) FROM
generate_series(1, 30) g
\\gexec

SET parallel_setup_cost = 1;
SET min_parallel_table_scan_size = '1kB';

SELECT oid FROM pg_catalog.pg_class WHERE relnamespace = 1 AND
 relkind IN ('r', 'm', 'v') AND pg_catalog.pg_table_is_visible(oid);

SELECT format('DROP TABLE testxmlschema_$c.test_%s', g) FROM
generate_series(1, 30) g
\\gexec

DROP SCHEMA testxmlschema_$c;
EOF
done
wait
grep 'ERROR:' server.log && break;
done

And couldn't get the error, for multiple runs. (Here SELECT oid ... is
based on the query executed by schema_to_xmlschema().)
But I could reliably get the error with
s/pg_table_is_visible(oid)/has_table_privilege (oid, 'SELECT')/.
So there is a difference between these two functions. And the difference is
in their costs.
If I do "ALTER FUNCTION pg_table_is_visible COST 1" before the script,
I get the error as expected.
With cost 10 I see the following plan:
 Index Scan using pg_class_relname_nsp_index on pg_class (cost=0.42..2922.38 rows=1 width=4)
   Index Cond: (relnamespace = '1'::oid)
   Filter: ((relkind = ANY ('{r,m,v}'::"char"[])) AND pg_table_is_visible(oid))

But with cost 1:
 Gather  (cost=1.00..257.10 rows=1 width=4)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on pg_class  (cost=0.00..256.00 rows=1 width=4)
         Filter: (pg_table_is_visible(oid) AND (relnamespace = '1'::oid) AND (relkind = ANY ('{r,m,v}'::"char"[])))
         Rows Removed by Filter: 405

The cost of the pg_foo_is_visible functions was increased in a80889a73.
But all the has_xxx_privilige functions have cost 1, except for
has_any_column_privilege, which cost was also increased in 7449427a1.

So to see the issue we need several ingredients:
1) The mode CATCACHE_FORCE_RELEASE enabled (may be some other way is
 possible, I don't know of);
    - Thanks to prion for that.
2) A function with the coding pattern
 "SearchSysCacheExistsX();  SearchSysCacheX();" called in a parallel worker;
    - Thanks to "debug_parallel_query = regress" and low cost of
      has_table_privilege() called by schema_to_xmlschema().
3) The catalog cache invalidated by some concurrent activity.
    - Thanks to running the test xmlmap in parallel with 16 other tests.

[1] https://www.postgresql.org/message-id/18014-28c81cb79d44295d%40postgresql.org

Best regards,
Alexander

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Erki Eessaar 2023-10-14 09:30:48 System administration functions about relation size ignore changes in the table structure
Previous Message Andres Freund 2023-10-14 02:34:43 Re: BUG #18130: \copy fails with "could not read block" or "page should be empty but not" errors due to triggers