From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | will(at)extrahop(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #17973: Reinit of pgstats entry for dropped DB can break autovacuum daemon |
Date: | 2023-06-14 21:57:44 |
Message-ID: | 20230614215744.ddeuii7jcojbp3yv@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
On 2023-06-14 14:37:24 -0700, Andres Freund wrote:
> On 2023-06-13 23:49:27 +0000, PG Bug reporting form wrote:
> > In Jacob's repro, the problem begins when AutoVacWorkerMain() is in
> > InitPostgres() and some other backend drops the database. Specifically, the
> > autovacuum worker's InitPostgres() is just about to obtain the lock at
> > https://github.com/postgres/postgres/blob/bd590d1fea1ba9245c791d589eea94d2dbad5a2b/src/backend/utils/init/postinit.c#L1012
> > . The backend dropping the DB marks the DB's pgstats entry as dropped but
> > can’t free it because its refcount is nonzero, so
> > AtEOXact_PgStat_DroppedStats() calls pgstat_request_entry_refs_gc(). The
> > autovacuum worker's InitPostgres() proceeds to call GetDatabaseTuple() and
> > notices the database has been dropped, at some point calling
> > pgstat_gc_entry_refs() to release its reference to the DB's pgstats entry,
> > and the worker decides to exit with a fatal error. But while exiting, it
> > calls pgstat_report_stat(), which calls pgstat_update_dbstats() for the
> > dropped DB, which calls pgstat_prep_database_pending(), which calls
> > pgstat_prep_pending_entry(), which calls pgstat_get_entry_ref() with create
> > == true, which calls pgstat_reinit_entry() against the DB's pgstats entry.
> > This sets dropped to false on that entry. Finally, the autovacuum worker
> > exits.
>
> Hm. It's pretty nasty that we end up in the proc_exit() hooks with
> MyDatabaseId set, even though we aren't actually validly connected to that
> database.
>
> ISTM that at the very least we should make "Recheck pg_database to make sure
> the target database hasn't gone away." in InitPostgres() unset
> MyDatabaseId, MyProc->databaseId.
>
> We unfortunately can't just defer setting up MyProc->databaseId, I
> think. Otherwise a concurrent vacuum could decide to remove database contents
> we think we still can see.
Hm. On second thought, I don't see any reason to do
/*
* Now we can mark our PGPROC entry with the database ID.
*
* We assume this is an atomic store so no lock is needed; though actually
* things would work fine even if it weren't atomic. Anyone searching the
* ProcArray for this database's ID should hold the database lock, so they
* would not be executing concurrently with this store. A process looking
* for another database's ID could in theory see a chance match if it read
* a partially-updated databaseId value; but as long as all such searches
* wait and retry, as in CountOtherDBBackends(), they will certainly see
* the correct value on their next try.
*/
MyProc->databaseId = MyDatabaseId;
/*
* We established a catalog snapshot while reading pg_authid and/or
* pg_database; but until we have set up MyDatabaseId, we won't react to
* incoming sinval messages for unshared catalogs, so we won't realize it
* if the snapshot has been invalidated. Assume it's no good anymore.
*/
InvalidateCatalogSnapshot();
before
/*
* Recheck pg_database to make sure the target database hasn't gone away.
* If there was a concurrent DROP DATABASE, this ensures we will die
* cleanly without creating a mess.
*/
if (!bootstrap)
{
as pg_database is a shared catalog. So we can safely access it at that
point. Or am I missing something?
If I am not, then we we should defer setting MyDatabaseId until after "Recheck
pg_database", e.g. by setting 'dboid' for the in_dbname path.
I think that would likely end up with *less* code than before, because we'd
need less redundant code.
It's a bit odd that we do the second lookup with dbname, rather the oid we
previously looked up.
Afaics there's no reason to look up MyDatabaseTableSpace before the recheck.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-06-14 22:15:03 | Re: BUG #17974: Walsenders memory usage suddenly spike to 80G+ causing OOM and server reboot |
Previous Message | Peter Geoghegan | 2023-06-14 21:38:30 | Re: BUG #17975: Nested Loop Index Scan returning wrong result |