Re: [GENERAL] postmaster deadlock while logging after syslogger exited

From: David Pacheco <dap(at)joyent(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] postmaster deadlock while logging after syslogger exited
Date: 2017-11-20 18:13:57
Message-ID: CACukRjPp+JicsFFKD-P6ORQhjs_sEw_OFOXXkefWbk0ToNywAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Responding to several points below.

Tom Lane wrote:
> David's report isn't too clear: did the syslogger process actually run
> out of memory and exit of its own volition after an ENOMEM, or did it get
> killed by the dreaded OOM killer? In either case, it's unclear whether
> it was really using an excessive amount of memory. We have not heard
> reports suggesting a memory leak in the syslogger, but maybe there is
> one under unusual circumstances?

I'm running on illumos, where there is no OOM killer.

I expect what happened is that the syslogger process attempted to allocate
memory, failed because the system was low, and explicitly exited. That's
consistent with an exited process, no core file generated, and the "FATAL"
"out
of memory" entries in the log I posted.

Of course, that doesn't mean the syslogger was using much memory. It means
only
that it attempted to allocate any memory in a context where it could not
handle
failing.

Andres Freund wrote:
> I'm somewhat inclined to say that one has to live with this if the
> system is so resource constrainted that processes barely using memory
> get killed.

Again, there's no OOM killer here, so the process has the opportunity to
respond
differently. I can understand the database not functioning well when the
system
is out of memory, but this half-functioning, half-deadlocked state is quite
disruptive. By comparison, while not ideal, if the syslogger had raised
SIGABRT
instead, that would have been significantly better for my use-case. Our
process
manager would have terminated all of the processes associated with the
database
and restarted it.

Tom Lane wrote:
> Hm, so that's another angle David didn't report on: is it possible that
> his workload could have resulted in a very large volume of incomplete
> in-progress log messages?

Yes. I mentioned in my Nov 6 mail that large log messages over a short
period
appear to have been a major contributing factor:

> Here are all of the entries in the PostgreSQL log from 23:19:12 until the
top of
> the next hour:
>
https://gist.githubusercontent.com/davepacheco/c5541bb464532075f2da761dd990a457/raw/2ba242055aca2fb374e9118045a830d08c590e0a/gistfile1.txt
...
> There are also many log entries for some very long SQL queries. I'm sure
that
> contributed to this problem by filling up the pipe. I was able to
extract the
> contents of the pipe while the system was hung, and it was more of these
giant
> query strings.

Andres Freund wrote:
> Note that there's plenty of cases where you could run into this even
> without being unable to fork new processes. You'd e.g. could also run
> into this while logging the exit of some other subprocess or such,
> there's enough ereports in postmaster.

This is a critical point. As far as I can tell, all that's necessary for
this
deadlock to occur is:

- the syslogger is unable to make forward progress (e.g., because it
segfaulted)
- enough other processes write to the pipe that it fills up before the
postmaster can restart the syslogger
- the postmaster attempts to write to the log as part of any of its signal
handling operations

It seems to me that the same thing can happen if there were a bug in the
syslogger that resulted in a segfault around the same time that the
postmaster
attempted to start an autovacuum worker, for example (if there was also
reasonably high log traffic).

Thanks,
Dave

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Yang, Xiaoguang 2017-11-20 18:16:34 unsubscribe
Previous Message Noel 2017-11-20 18:07:15 unsubscribe