Re: emergency outage requiring database restart

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Kevin Grittner <kgrittn(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: emergency outage requiring database restart
Date: 2016-10-21 18:58:10
Message-ID: CAHyXU0zzgA6x0jOceWOFxoUqN=riEd6nijetH-ktzzwti5Rz5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 21, 2016 at 1:37 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Fri, Oct 21, 2016 at 8:03 AM, Kevin Grittner <kgrittn(at)gmail(dot)com> wrote:
>> On Tue, Oct 18, 2016 at 8:45 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>
>>> Most or all the damage seemed to be to the system catalogs with
>>> at least two critical tables dropped or inaccessible in some
>>> fashion. A lot of the OIDs seemed to be pointing at the wrong
>>> thing.
>>
>> While the oid in pg_class often matches the filename, that is not
>> true after some operations (like CLUSTER or VACUUM FULL). It is
>> the relfilenode column that is the definitive link to the file.
>
> no such operations happened. In the first instance at least one table
> dropped from the system catalogs. I have a hunch that the heap is
> fine (supported by the size of the database on disk). At this
> precise moment I'm restoring the database to another fileserver in
> order to do some forensic analysis, also in the hopes of getting the
> second database online in order to expedite recovery.
>
> ah -- done. :-) deleting the init file didn't help, but starting up
> single user allowed the start up to gracefully fail with a FATAL cache
> lookup.

OK, I have some good (very- in the specific case of yours truly) news
to report. Doing a filesystem level copy to a test server I was able
to relfilenode swap one of the critical tables over the place of the
refilenode of the stored backup. Not being able know the file to copy
from, I figured out the source node by judging the size and using
'strings' utility. Data recovery for that table at least appears to
be 100%.

For those following along, this simple process is only likely to work
easily if the table contains only system types; no user types, enums,
composites, etc, since those have a unique ID for each data restore.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2016-10-21 19:02:50 Re: emergency outage requiring database restart
Previous Message Merlin Moncure 2016-10-21 18:37:22 Re: emergency outage requiring database restart