Re: Proper object locking for GRANT/REVOKE

From: Noah Misch <noah(at)leadboat(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proper object locking for GRANT/REVOKE
Date: 2024-12-09 01:25:41
Message-ID: 20241209012541.11.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 02, 2024 at 12:13:56PM +0100, Peter Eisentraut wrote:
> On 25.11.24 02:24, Noah Misch wrote:
> > commit d31bbfb wrote:
> > > --- a/src/backend/catalog/aclchk.c
> > > +++ b/src/backend/catalog/aclchk.c
> > > @@ -659,147 +659,77 @@ ExecGrantStmt_oids(InternalGrant *istmt)
> > > * objectNamesToOids
> > > *
> > > * Turn a list of object names of a given type into an Oid list.
> > > - *
> > > - * XXX: This function doesn't take any sort of locks on the objects whose
> > > - * names it looks up. In the face of concurrent DDL, we might easily latch
> > > - * onto an old version of an object, causing the GRANT or REVOKE statement
> > > - * to fail.
> >
> > To prevent "latch onto an old version" and remove the last sentence of the
> > comment, we'd need two more things:
> >
> > - Use a self-exclusive lock here, not AccessShareLock. With two sessions
> > doing GRANT under AccessShareLock, one will "latch onto an old version".
> >
> > - Use a self-exclusive lock before *every* CatalogTupleUpdate() of a row that
> > GRANT/REVOKE can affect. For example, today's locking in ALTER FUNCTION is
> > the xmax stamped on the old tuple. If GRANT switched to
> > ShareUpdateExclusiveLock, concurrent ALTER FUNCTION could still cause GRANT
> > to "latch onto an old version".
>
> Ok, we should probably put that comment back in slightly altered form, like
>
> "XXX This function intentionally takes only an AccessShareLock ... $REASON.
> In the face of concurrent DDL, we might easily latch
> onto an old version of an object, causing the GRANT or REVOKE statement
> to fail."

Yep.

> > I wouldn't do those, however. It would make GRANT ALL TABLES IN SCHEMA
> > terminate every autovacuum running in the schema and consume a shared lock
> > table entry per table in the schema. I think the user-visible benefit of
> > commit d31bbfb plus this additional work is just changing "ERROR: tuple
> > concurrently updated" to blocking. That's not nothing, but I don't see it
> > outweighing autovacuum termination and lock table consumption spikes. What
> > other benefits and drawbacks should we weigh?
>
> I think what are describing is a reasonable tradeoff. The user experience
> is tolerable: "tuple concurrently updated" is a mildly useful error message,
> and it's probably the table owner executing both commands.
>
> The change to AccessShareLock at least prevents confusing "cache lookup
> failed" messages, and might alleviate some security concerns about swapping
> in a completely different object concurrently (even if we currently think
> this is not an actual problem).

Perhaps. To me, the v17 behavior smells mildly superior to the v18 behavior.

> > > --- a/src/test/isolation/expected/intra-grant-inplace.out
> > > +++ b/src/test/isolation/expected/intra-grant-inplace.out
> > > @@ -248,6 +248,6 @@ relhasindex
> > > -----------
> > > (0 rows)
> > > -s4: WARNING: got: cache lookup failed for relation REDACTED
> > > +s4: WARNING: got: relation "intra_grant_inplace" does not exist
> >
> > The affected permutation existed to cover the first LockRelease() in
> > SearchSysCacheLocked1(). Since this commit, that line no longer has coverage.
>
> Do you have an idea how such a test case could be constructed now?

A rough idea. The test worked because REVOKE used only LOCKTAG_TUPLE, which
didn't mind the LOCKTAG_RELATION from DROP TABLE.

One route might be to find another SearchSysCacheLocked1() caller that takes
no locks beyond the LOCKTAG_TUPLE taken right in SearchSysCacheLocked1(). I'd
add a temporary elog to report if that's happening.
check_lock_if_inplace_updateable_rel() is an example of reporting the absence
of a lock. If check-world w/ that elog finds some operation reaching that
circumstance, this test could replace REVOKE with that operation.

Another route would be to remove the catalog row without locking the
underlying object, e.g. by replacing DROP TABLE with DELETE FROM pg_class.
That's more artificial.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-12-09 01:28:26 Re: [Bug] Heap Use After Free in Window Aggregate Execution
Previous Message Tomas Vondra 2024-12-08 23:12:32 Re: Refactoring postmaster's code to cleanup after child exit