Re: Properly handle OOM death?

From: "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Properly handle OOM death?
Date: 2023-03-13 20:16:29
Message-ID: 20230313201629.5d6nkptfxy3qs5fr@hjp.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2023-03-13 09:55:50 -0800, Israel Brewster wrote:
> On Mar 13, 2023, at 9:43 AM, Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> wrote:
> > On 2023-03-13 09:21:18 -0800, Israel Brewster wrote:
> >> I’m running a postgresql 13 database on an Ubuntu 20.04 VM that is a bit more
> >> memory constrained than I would like, such that every week or so the various
> >> processes running on the machine will align badly and the OOM killer will kick
> >> in, killing off postgresql, as per the following journalctl output:
> >>
> >> Mar 12 04:04:23 novarupta systemd[1]: postgresql(at)13-main(dot)service: A process of
> >> this unit has been killed by the OOM killer.
> >> Mar 12 04:04:32 novarupta systemd[1]: postgresql(at)13-main(dot)service: Failed with
> >> result 'oom-kill'.
> >> Mar 12 04:04:32 novarupta systemd[1]: postgresql(at)13-main(dot)service: Consumed 5d
> >> 17h 48min 24.509s CPU time.
> >>
> >> And the service is no longer running.
> >
> > I might be misreading this, but it looks to me that systemd detects that
> > *some* process in the group was killed by the oom killer and stops the
> > service.
> >
> > Can you check which process was actually killed? If it's not the
> > postmaster, setting OOMScoreAdjust is probably useless.
> >
> > (I tried searching the web for the error messages and didn't find
> > anything useful)
>
> Your guess is as good as (if not better than) mine. I can find the PID
> of the killed process in the system log, but without knowing what the
> PID of postmaster and the child processes were prior to the kill, I’m
> not sure that helps much.

The syslog should contain a list of all tasks prior to the kill. For
example, I just provoked an OOM kill on my laptop and the syslog
contains (among lots of others) these lines:

Mar 13 21:00:36 trintignant kernel: [112024.084117] [ 2721] 126 2721 54563 2042 163840 555 -900 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084123] [ 2873] 126 2873 18211 85 114688 594 0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084128] [ 2941] 126 2941 54592 1231 147456 565 0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084134] [ 2942] 126 2942 54563 535 143360 550 0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084139] [ 2943] 126 2943 54563 1243 139264 548 0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084145] [ 2944] 126 2944 54798 561 147456 545 0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084150] [ 2945] 126 2945 54563 215 131072 551 0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084156] [ 2956] 126 2956 18718 506 122880 553 0 postgres
Mar 13 21:00:36 trintignant kernel: [112024.084161] [ 2957] 126 2957 54672 269 139264 546 0 postgres

That's less helpful than it could be since all the postgres processes
are just listed as "postgres" without arguments. However, it is very
likely that the first one is actually the postmaster, because it has the
lowest pid (and the other pids follow closely) and it has an OOM score
of -900 as set in the systemd service file.

So I could compare the PID of the killed process with this list (in my
case the killed process wasn't one of them but a test program which just
allocates lots of memory).

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp(at)hjp(dot)at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Israel Brewster 2023-03-13 20:18:34 Re: Properly handle OOM death?
Previous Message Joe Conway 2023-03-13 19:42:52 Re: Properly handle OOM death?