Re: Crash during backend start when low on memory

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Mats Kindahl <mats(at)timescale(dot)com>
Cc: daniel(at)yesql(dot)se, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Crash during backend start when low on memory
Date: 2023-01-13 11:53:52
Message-ID: 20230113115352.sbuv5egpjyoxn52i@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 2023-Jan-13, Mats Kindahl wrote:

> I have an improved patch based on raising an error when malloc fails. I
> used the version of mallocfail that is available at
> https://github.com/ralight/mallocfail which allows you to run the program
> with incrementally failing memory allocations but since the count is around
> 6886 or so when the postmaster starts to allow backend starts and other
> processes are starting in between, this count is not always accurate and
> something "smarter" is needed to get a reliable failure test.

Hmm, but I'm not sure I like what you did to BackendStartup:

@@ -4054,14 +4073,7 @@ BackendStartup(Port *port)
* Create backend data structure. Better before the fork() so we can
* handle failure cleanly.
*/
- bn = (Backend *) malloc(sizeof(Backend));
- if (!bn)
- {
- ereport(LOG,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
- return STATUS_ERROR;
- }
+ CHECKED_CALL(bn = malloc(sizeof(Backend)));

With this code, now we will have postmaster raise ERROR if this malloc
fails, rather than returning STATUS_ERROR as previously. How does
postmaster react to this? I think it won't go very well.

(Stylistically, I'd rather have the test and corresponding reaction at
each call site rather than using a macro to hide it. I especially don't
like the fact that the errdetail you propose would report a line of C
code to the user, rather than an intelligible explanation of what that
code is trying to do.)

> It is quite straightforward to test it by attaching a debugger and setting
> the return value of malloc, but to be able to have some "smarter" failures
> only when, for example, starting a backend it would be necessary to have
> something better integrated with the postgres code. Being able to attach a
> uprobe using bpftrace and override the return value would be the most
> elegant solution, but that is currently not possible, so considering
> whether to add something like the mallocfail above to the regular testing.

Hmm, I'm not sure that this type of testing is something to do on an
ongoing basis.

> Note: there are a bunch of other core dumps that occur during startup
> before the postmaster is ready to accept connections, but I think that is
> less of a problem since it fails during startup before the system is ready
> to accept connections.

That sounds reasonable. The service watcher should restart postmaster
appropriately in that case (be it old SysV init, systemd, your container
management layer, or whatever.)

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alex Richman 2023-01-13 12:15:00 Re: Logical Replica ReorderBuffer Size Accounting Issues
Previous Message wangw.fnst@fujitsu.com 2023-01-13 11:17:42 RE: Logical Replica ReorderBuffer Size Accounting Issues