Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-20 19:05:12
Message-ID: CAHyXU0zcG4pAPpRf+2UA7dkV8Tp_VSzhRyFTmq7=M019N0=t2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 20, 2016 at 1:52 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Wed, Oct 19, 2016 at 2:39 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Wed, Oct 19, 2016 at 9:56 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>> On Wed, Oct 19, 2016 at 08:54:48AM -0500, Merlin Moncure wrote:
>>>> > Yeah. Believe me -- I know the drill. Most or all the damage seemed
>>>> > to be to the system catalogs with at least two critical tables dropped
>>>> > or inaccessible in some fashion. A lot of the OIDs seemed to be
>>>> > pointing at the wrong thing. Couple more datapoints here.
>>>> >
>>>> > *) This database is OLTP, doing ~ 20 tps avg (but very bursty)
>>>> > *) Another database on the same cluster was not impacted. However
>>>> > it's more olap style and may not have been written to during the
>>>> > outage
>>>> >
>>>> > Now, this infrastructure running this system is running maybe 100ish
>>>> > postgres clusters and maybe 1000ish sql server instances with
>>>> > approximately zero unexplained data corruption issues in the 5 years
>>>> > I've been here. Having said that, this definitely smells and feels
>>>> > like something on the infrastructure side. I'll follow up if I have
>>>> > any useful info.
>>>>
>>>> After a thorough investigation I now have credible evidence the source
>>>> of the damage did not originate from the database itself.
>>>> Specifically, this database is mounted on the same volume as the
>>>> operating system (I know, I know) and something non database driven
>>>> sucked up disk space very rapidly and exhausted the volume -- fast
>>>> enough that sar didn't pick it up. Oh well :-) -- thanks for the help
>>>
>>> However, disk space exhaustion should not lead to corruption unless the
>>> underlying layers lied in some way.
>>
>> I agree -- however I'm sufficiently separated from the things doing
>> the things that I can't verify that in any real way. In the meantime
>> I'm going to take standard precautions (enable checksums/dedicated
>> volume/replication). Low disk space also does not explain the bizarre
>> outage I had last friday.
>
> ok, data corruption struck again. This time disk space is ruled out,
> and access to the database is completely denied:
> postgres=# \c castaging
> WARNING: leaking still-referenced relcache entry for
> "pg_index_indexrelid_index"

single user mode dumps core :(

bash-4.1$ postgres --single -D /var/lib/pgsql/9.5/data castaging
LOG: 00000: could not change directory to "/root": Permission denied
LOCATION: resolve_symlinks, exec.c:293
Segmentation fault (core dumped)

Core was generated by `postgres --single -D /var/lib/pgsql/9.5/data castaging'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000797d6f in ?? ()
Missing separate debuginfos, use: debuginfo-install
postgresql95-server-9.5.2-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x0000000000797d6f in ?? ()
#1 0x000000000079acf1 in RelationCacheInitializePhase3 ()
#2 0x00000000007b35c5 in InitPostgres ()
#3 0x00000000006b9b53 in PostgresMain ()
#4 0x00000000005f30fb in main ()
(gdb)

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-10-20 19:07:43 Re: emergency outage requiring database restart
Previous Message David Fetter 2016-10-20 18:52:39 Re: Renaming of pg_xlog and pg_clog