Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-18 13:45:55
Message-ID: CAHyXU0wLgMvD_KVJyfZhACBpkfDbPEawkqbx2EObYxMt2O=kMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 17, 2016 at 2:04 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Merlin Moncure wrote:
>
>> castaging=# CREATE OR REPLACE VIEW vw_ApartmentSample AS
>> castaging-# SELECT ...
>> ERROR: 42809: "pg_cast_oid_index" is an index
>> LINE 11: FROM ApartmentSample s
>> ^
>> LOCATION: heap_openrv_extended, heapam.c:1304
>>
>> should I be restoring from backups?
>
> It's pretty clear to me that you've got catalog corruption here. You
> can try to fix things manually as they emerge, but that sounds like a
> fool's errand.

Yeah. Believe me -- I know the drill. Most or all the damage seemed
to be to the system catalogs with at least two critical tables dropped
or inaccessible in some fashion. A lot of the OIDs seemed to be
pointing at the wrong thing. Couple more datapoints here.

*) This database is OLTP, doing ~ 20 tps avg (but very bursty)
*) Another database on the same cluster was not impacted. However
it's more olap style and may not have been written to during the
outage

Now, this infrastructure running this system is running maybe 100ish
postgres clusters and maybe 1000ish sql server instances with
approximately zero unexplained data corruption issues in the 5 years
I've been here. Having said that, this definitely smells and feels
like something on the infrastructure side. I'll follow up if I have
any useful info.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2016-10-18 14:03:39 Re: Query cancel seems to be broken in master since Oct 17
Previous Message Heikki Linnakangas 2016-10-18 13:31:00 Re: Query cancel seems to be broken in master since Oct 17