Re: Crash during backend start when low on memory

From: Mats Kindahl <mats(at)timescale(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: daniel(at)yesql(dot)se, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Crash during backend start when low on memory
Date: 2023-01-16 10:09:47
Message-ID: CA+144260BBLj2JQ93CxEbDvULkW9M2GUWvKcVOujEwByALo49g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Jan 13, 2023 at 12:58 PM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
wrote:

> On 2023-Jan-13, Mats Kindahl wrote:
>
> > I have an improved patch based on raising an error when malloc fails. I
> > used the version of mallocfail that is available at
> > https://github.com/ralight/mallocfail which allows you to run the
> program
> > with incrementally failing memory allocations but since the count is
> around
> > 6886 or so when the postmaster starts to allow backend starts and other
> > processes are starting in between, this count is not always accurate and
> > something "smarter" is needed to get a reliable failure test.
>
> Hmm, but I'm not sure I like what you did to BackendStartup:
>
> @@ -4054,14 +4073,7 @@ BackendStartup(Port *port)
> * Create backend data structure. Better before the fork() so we
> can
> * handle failure cleanly.
> */
> - bn = (Backend *) malloc(sizeof(Backend));
> - if (!bn)
> - {
> - ereport(LOG,
> - (errcode(ERRCODE_OUT_OF_MEMORY),
> - errmsg("out of memory")));
> - return STATUS_ERROR;
> - }
> + CHECKED_CALL(bn = malloc(sizeof(Backend)));
>
> With this code, now we will have postmaster raise ERROR if this malloc
> fails, rather than returning STATUS_ERROR as previously. How does
> postmaster react to this? I think it won't go very well.
>

Well.. this error is caught in the caller using this code:

PG_TRY();
{
port = ConnCreate(ListenSocket[i]);
BackendStartup(port);
}
PG_FINALLY();
{
/*
* We no longer need the open socket or port structure
* in this process, and definitely not if we failed
* spawning a backend.
*/
if (port)
{
if (port->sock != PGINVALID_SOCKET)
StreamClose(port->sock);
ConnFree(port);
}
}
PG_END_TRY();

Unless I am mistaken, this should mean that the error message is printed,
the streams closed and memory free'ed, and PostmasterMain continues as
before. It could just be logged using LOG level, but an ERROR stands out
better and is searchable so it is easy to surface in systems monitoring the
server. (Digging for specific log messages is cumbersome, just printing out
all ERROR messages in the log is easy.)

> (Stylistically, I'd rather have the test and corresponding reaction at
> each call site rather than using a macro to hide it. I especially don't
> like the fact that the errdetail you propose would report a line of C
> code to the user, rather than an intelligible explanation of what that
> code is trying to do.)
>

I agree this is ugly, but since this was rare and the error message is
mostly to understand what went wrong to be able to debug it, I thought this
would be sufficient for that. It is also easier to maintain. Trivial to
replace.

> > It is quite straightforward to test it by attaching a debugger and
> setting
> > the return value of malloc, but to be able to have some "smarter"
> failures
> > only when, for example, starting a backend it would be necessary to have
> > something better integrated with the postgres code. Being able to attach
> a
> > uprobe using bpftrace and override the return value would be the most
> > elegant solution, but that is currently not possible, so considering
> > whether to add something like the mallocfail above to the regular
> testing.
>
> Hmm, I'm not sure that this type of testing is something to do on an
> ongoing basis.
>

Me neither, which is why I didn't spend time on it right now. The argument
for doing it is that it would be good to have some test that can at least
be easily used to check that it behaves correctly on startup even if memory
failures occur.

>
> > Note: there are a bunch of other core dumps that occur during startup
> > before the postmaster is ready to accept connections, but I think that is
> > less of a problem since it fails during startup before the system is
> ready
> > to accept connections.
>
> That sounds reasonable. The service watcher should restart postmaster
> appropriately in that case (be it old SysV init, systemd, your container
> management layer, or whatever.)
>

Yes, that was my thinking as well. Core dumps are ugly, but provide useful
information to figure out where things went wrong.

>
> --
> Álvaro Herrera PostgreSQL Developer —
> https://www.EnterpriseDB.com/
> "Ellos andaban todos desnudos como su madre los parió, y también las
> mujeres,
> aunque no vi más que una, harto moza, y todos los que yo vi eran todos
> mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2023-01-16 15:14:59 Re: BUG #17698: On SIGTERM, psql terminates, but leaves the statement running
Previous Message Amit Kapila 2023-01-16 09:32:16 Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1