From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Floris Van Nee <florisvannee(at)optiver(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: error "can only drop stats once" brings down database |
Date: | 2024-05-05 16:09:15 |
Message-ID: | 20240505160915.6boysum4f34siqct@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
On 2024-05-03 18:10:05 +0000, Floris Van Nee wrote:
> > Floris Van Nee <florisvannee(at)Optiver(dot)com> writes:
> > > Hi,
> > > On a database we have we've recently seen a fatal error occur twice. The
> > error happened on two different physical replicas (of the same cluster)
> > during a WAL redo action in the recovery process. They're running Postgres
> > 15.5.
> >
> > > Occurrence 1:
> > > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04
> > CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at
> > A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01;
> > ...
> >
> > Hmm. This must be coming from pgstat_drop_entry_internal.
> > I suspect the correct fix is in pgstat_drop_entry, along the lines of
> >
> > - if (shent)
> > + if (shent && !shent->dropped)
> >
> > but it's not clear to me how the already-dropped case ought to affect the
> > function's bool result.
I don't think that'd be quite right - just ignoring that we're confused about
tracking "stats object" liveliness seems likely to hide bugs.
Elsewhere in this thread you suggested adding more details about the error -
let's do that. Something like the attached might already be an improvement?
> Also, how are we getting into a concurrent-drop situation in recovery?
I'd like to know how we get into the situation too. It's perhaps worth noting
that stats can be generated on a standby, albeit not by the replay
process. But locking should prevent active use of the stats entry when it's
being dropped...
> Anyone has further thoughts on this? This still happens occasionally.
Do you have any more details about the workload leading to this issue? Is the
standby used for queries? Given the "high value" your oids/relfilenodes have,
I assume there are a lot of created/dropped/truncated relations?
Greetings,
Andres Freund
Attachment | Content-Type | Size |
---|---|---|
pgstat_already_dropped_verbose.diff | text/x-diff | 717 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2024-05-05 18:37:41 | Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure |
Previous Message | David Rowley | 2024-05-05 00:55:38 | Re: BUG #18305: Unexpected error: "WindowFunc not found in subplan target lists" triggered by subqueries |