Re: What to do when dynamic shared memory control segment is corrupt

From: Sherrylyn Branchaw <sbranchaw(at)gmail(dot)com>
To: pg(at)bowt(dot)ie
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, andres(at)anarazel(dot)de, pgsql-general(at)postgresql(dot)org
Subject: Re: What to do when dynamic shared memory control segment is corrupt
Date: 2018-06-18 23:50:04
Message-ID: CAB_myF5EaCVsBQ24rb4gLeLSau+Gv0otY9Y6nk5xnpw5LvYv7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> Hm ... were these installations built with --enable-cassert? If not,
> an abort trap seems pretty odd.

The packages are installed directly from the yum repos for RHEL. I'm not
aware that --enable-cassert is being used, and we're certainly not
installing from source.

> Those "incomplete data" messages are quite unexpected and disturbing.
> I don't know of any mechanism within Postgres proper that would result
> in corruption of the postmaster.pid file that way. (I wondered briefly
> if trying to start a conflicting postmaster would result in such a
> situation, but experimentation here says not.) I'm suspicious that
> this may indicate a bug or unwarranted assumption in whatever scripts
> you use to start/stop the postmaster. Whether that is at all related
> to your crash issue is hard to say, but it bears looking into.

We're using the stock initd script from the yum repo, but I dug into this
issue, and it looks like we're passing the path to the postmaster.pid as
the $pidfile variable in our sysconfig file, meaning the initd script is
managing the postmaster.pid file, and specifically is overwriting it with a
single line containing just the pid. I'm not sure why it's set up like
this, and I'm thinking we should change it, but it seems harmless and
unrelated to the crash. In particular, manual initd actions such as stop,
start, restart, and status all work fine.

> No, that looks like fairly typical crash recovery to me: corrupt shared
> memory contents are expected and recovered from after a crash.

That's reassuring. But if it's safe for us to immediately start the server
back up, why did Postgres not automatically start the server up like it did
the first time? I was assuming it was due to the presence of the corrupt
memory segment, as that was the only difference in the logs, although I
could be wrong. Automatic restart would have saved us a great deal of
downtime; since in the first case we had total recovery within 30 seconds,
and in the second case, many minutes of downtime while someone got paged,
troubleshot the issue, and eventually decided to try starting the database
back up.

At any rate, if it's safe, we can write a script to detect this failure
mode and automatically restart, although it would be less error-prone if
Postgres restarted automatically.

> Hm, I supposed that Sherrylyn would've noticed any PANIC entries in
> the log.

No PANICs. The log lines I pasted were the only ones that looked relevant
in the Postgres logs. I can try to dig through the application logs, but I
was planning to wait until the next time this happens, since we should have
core dumps fixed and that might make things easier.

> What extensions are installed, if any?

In the first database, the one without the corrupt memory segment and that
restarted automatically: plpgsql and postgres_fdw.

In the second database, the one where the memory segment got corrupted and
that didn't restart automatically: dblink, hstore, pg_trgm, pgstattuple,
plpgsql, and tablefunc.

I forgot to mention that the queries that got killed were innocuous-looking
SELECTs that completed successfully for me in less than a second when I ran
them manually. In other words, the problem was not reproducible.

Sherrylyn

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Robert Creager 2018-06-19 00:28:03 Re: Query hitting empty tables taking 48 minutes
Previous Message Rob Sargent 2018-06-18 23:34:04 Re: Run Stored procedure - function from VBA