Re: BUG #17973: Reinit of pgstats entry for dropped DB can break autovacuum daemon

From: Andres Freund <andres(at)anarazel(dot)de>
To: will(at)extrahop(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17973: Reinit of pgstats entry for dropped DB can break autovacuum daemon
Date: 2023-06-14 21:37:24
Message-ID: 20230614213724.gweo666oxqrugm44@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

Good catch!

On 2023-06-13 23:49:27 +0000, PG Bug reporting form wrote:
> We've reproduced this on both 15.2 and REL_15_STABLE (git commit
> bd590d1fea1ba9245c791d589eea94d2dbad5a2b). We've never seen a similar issue
> on 14 despite very similar conditions occurring frequently. We haven't yet
> tried on 16.

The relevant code is new for 15, so that makes sense.

> In Jacob's repro, the problem begins when AutoVacWorkerMain() is in
> InitPostgres() and some other backend drops the database. Specifically, the
> autovacuum worker's InitPostgres() is just about to obtain the lock at
> https://github.com/postgres/postgres/blob/bd590d1fea1ba9245c791d589eea94d2dbad5a2b/src/backend/utils/init/postinit.c#L1012
> . The backend dropping the DB marks the DB's pgstats entry as dropped but
> can’t free it because its refcount is nonzero, so
> AtEOXact_PgStat_DroppedStats() calls pgstat_request_entry_refs_gc(). The
> autovacuum worker's InitPostgres() proceeds to call GetDatabaseTuple() and
> notices the database has been dropped, at some point calling
> pgstat_gc_entry_refs() to release its reference to the DB's pgstats entry,
> and the worker decides to exit with a fatal error. But while exiting, it
> calls pgstat_report_stat(), which calls pgstat_update_dbstats() for the
> dropped DB, which calls pgstat_prep_database_pending(), which calls
> pgstat_prep_pending_entry(), which calls pgstat_get_entry_ref() with create
> == true, which calls pgstat_reinit_entry() against the DB's pgstats entry.
> This sets dropped to false on that entry. Finally, the autovacuum worker
> exits.

Hm. It's pretty nasty that we end up in the proc_exit() hooks with
MyDatabaseId set, even though we aren't actually validly connected to that
database.

ISTM that at the very least we should make "Recheck pg_database to make sure
the target database hasn't gone away." in InitPostgres() unset
MyDatabaseId, MyProc->databaseId.

We unfortunately can't just defer setting up MyProc->databaseId, I
think. Otherwise a concurrent vacuum could decide to remove database contents
we think we still can see.

If we had something like that it should be fairly easy to skip relevant parts
of pgstat_report_stat(), based on MyDatabaseId being invalid.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Geoghegan 2023-06-14 21:38:30 Re: BUG #17975: Nested Loop Index Scan returning wrong result
Previous Message Andres Freund 2023-06-14 21:21:27 Re: BUG #17975: Nested Loop Index Scan returning wrong result