RE: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure

From: Floris Van Nee <florisvannee(at)Optiver(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "exclusion(at)gmail(dot)com" <exclusion(at)gmail(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure
Date: 2024-06-02 21:31:56
Message-ID: dd4c68beba5c40c5a7946569d450ec59@Optiver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

>
> I think the problem might actually be much simpler - when
> pgstat_drop_entry() can't free the entry, we don't call
> pgstat_request_entry_refs_gc(), in contrast to
> pgstat_drop_database_and_contents(), pgstat_drop_all_entries().
> Because of that, entry refs to dropped stats entries can survive an effectively
> unbound amount of time. If pgstat_request_entry_refs_gc() had been
> called, the local entry would have been released, before we reused the slot's
> stats entry. There's no race around the entry being dropped concurrently
> with acquiring an entry references, as external locking needs to prevent an
> entry from being dropped while new references are being established.

This sounds like a reasonable root cause for this thread.

> I can't explain why pgstat_drop_entry() doesn't call
> pgstat_request_entry_refs_gc(). It seems fairly obvious that it ought to?

It looks like pgstat_drop_entry expects callers to take care of this. All callers in pgstat_xact.c (AtEOXact_PgStat_DroppedStats for example) are dropping stats for several relations at once, and then at the end they call pgstat_request_entry_refs_gc once if needed. I'm guessing this was an optimization.

However, pgstat_drop_replslot didn't get that memo. Attached patch adds a call to gc in pgstat_drop_replslot to bring it in line with usage in pgstat_xact.c. On my setup this fixes the issue that occurs on this thread.

>
> I think there would be a leftover edge-case, namely that
> pgstat_gc_entry_refs() doesn't gc entries with pending data. But we could
> just do that - and it should be correct, afaict. I wonder if the reason we didn't
> is that we introduced PgStat_KindInfo.delete_pending_cb later?

Can you clarify why this would be safe to do? Looking at commit history, it looks like they were added in the same commit 5891c7a8e which introduced the shared memory stats. I can reason why it *should* be safe, but I can't find a discussion around why every single function except pgstat_drop_entry uses discard_pending=false if it would be safe to pass true everywhere. It seems strange that it'd have been chosen without reason.

This does mean it's unfortunately unrelated to the issue I reported here https://www.postgresql.org/message-id/flat/20240505223546.6yvjzgqifuoiii3e%40awork3.anarazel.de#1dfe3ecf6a75ace833444bdc0d268f4a , because the issue in the current thread is fixed by changing a replication-slot call (which is not used in the database for which I reported it).

-Floris

Attachment Content-Type Size
0001-Notify-stats-gc-when-dropping-logical-replication-sl.patch application/octet-stream 2.0 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Sjors Gielen 2024-06-02 21:36:06 Re: BUG #18365: Inconsistent cost function between materialized and non-materialized CTE
Previous Message Mor Lehr 2024-06-02 15:42:38 Re: Missing semicolumn in anonymous plpgsql block does not raise syntax error