Re: Re: How to solve the problem of one backend process crashing and causing other processes to restart?

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: yuansong <yyuansong(at)126(dot)com>
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Re: How to solve the problem of one backend process crashing and causing other processes to restart?
Date: 2023-11-14 02:03:23
Message-ID: CAHyXU0w-7r6e1rL4wqz5=z2Jg9=YDLBZLA=iHKqdv1jVuKFP4g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 13, 2023 at 3:14 AM yuansong <yyuansong(at)126(dot)com> wrote:

> Enhancing the overall fault tolerance of the entire system for this
> feature is quite important. No one can avoid bugs, and I don't believe that
> this approach is a more advanced one. It might be worth considering adding
> it to the roadmap so that interested parties can conduct relevant research.
>
> The current major issue is that when one process crashes, resetting all
> connections has a significant impact on other connections. Is it possible
> to only disconnect the crashed connection and have the other connections
> simply roll back the current transaction without reconnecting? Perhaps this
> problem can be mitigated through the use of a connection pool.
>
> If we want to implement this feature, would it be sufficient to clean up
> or restore the shared memory and disk changes caused by the crashed
> backend? Besides clearing various known locks, what else needs to be
> changed? Does anyone have any insights? Please help me. Thank you.
>

One thing that's really key to understand about postgres is that there are
a different set of rules regarding what is the database's job to solve vs
supporting libraries and frameworks. It isn't that hard to wait and retry
a query in most applications, and it is up to you to do that. There are
also various connection poolers that might implement retry logic, and not
having to work through those concerns keeps the code lean and has other
benefits. While postgres might implement things like a built in connection
pooler, 'o_direct' type memory management, and things like that, there are
long term costs to doing them.

There's another side to this. Suppose I had to choose between a
hypothetical postgres that had some kind of process local crash recovery
and the current implementation. I might still choose the current
implementation because, in general, crashes are good, and the full reset
has a much better chance of clearing the underlying issue that caused the
problem vs managing the symptoms of it.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-11-14 03:42:29 Re: Is this a problem in GenericXLogFinish()?
Previous Message Jeff Davis 2023-11-14 01:58:54 Re: Why do indexes and sorts use the database collation?