From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | "Pete St(dot) Onge" <pete(at)seul(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Hosed PostGreSQL Installation |
Date: | 2002-09-21 15:13:44 |
Message-ID: | 8533.1032621224@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
"Pete St. Onge" <pete(at)seul(dot)org> writes:
> As a result of some disk errors on another drive, an admin in our group
> brought down the server hosting our pgsql databases with a kill -KILL
> after having gone to runlevel 1 and finding the postmaster process still
> running. No surprise, our installation was hosed in the process.
That should not have been a catastrophic mistake in any version >= 7.1.
I suspect you had disk problems or other problems.
> Klamath suggested that I run pg_controldata:
> ...
> Latest checkpoint's StartUpID: 21
> Latest checkpoint's NextXID: 615
> Latest checkpoint's NextOID: 18720
These numbers are suspiciously small for an installation that's been
in production awhile. I suspect you have not told us the whole story;
in particular I suspect you already tried "pg_resetxlog -f", which was
probably not a good idea.
> If I look into the pg_xlog directory, I see this:
> -rw------- 1 postgres postgres 16777216 Sep 20 23:13 0000000000000002
> -rw------- 1 postgres postgres 16777216 Sep 19 22:09 000000020000007E
Yeah, your xlog positions should be a great deal higher than they are,
if segment 2/7E was previously in use.
It is likely that you can recover (with some uncertainty about integrity
of recent transactions) if you proceed as follows:
1. Get contrib/pg_resetxlog/pg_resetxlog.c from the 7.2.2 release (you
can't use 7.1's pg_resetxlog because it doesn't offer the switches
you'll need). Compile it *against your 7.1 headers*. It should compile
except you'll have to remove this change:
***************
*** 853,858 ****
--- 394,403 ----
page->xlp_magic = XLOG_PAGE_MAGIC;
page->xlp_info = 0;
page->xlp_sui = ControlFile.checkPointCopy.ThisStartUpID;
+ page->xlp_pageaddr.xlogid =
+ ControlFile.checkPointCopy.redo.xlogid;
+ page->xlp_pageaddr.xrecoff =
+ ControlFile.checkPointCopy.redo.xrecoff - SizeOfXLogPHD;
record = (XLogRecord *) ((char *) page + SizeOfXLogPHD);
record->xl_prev.xlogid = 0;
record->xl_prev.xrecoff = 0;
Test it using its -n switch to make sure it reports sane values.
2. Run the hacked-up pg_resetxlog like this:
pg_resetxlog -l 2 127 -x 1000000000 $PGDATA
(the -l position is next beyond what we see in pg_xlog, the 1-billion
XID is just a guess at something past where you were. Actually, can
you give us the size of pg_log, ie, $PGDATA/global/1269? That would
allow computing a correct next-XID to use. Figure 4 XIDs per byte,
thus if pg_log is 1 million bytes you need -x at least 4 million.)
3. The postmaster should start now.
4. *Immediately* attempt to do a pg_dumpall. Do not pass GO, do not
collect $200, do not let in any interactive clients until you've done
it. (I'd suggest tweaking pg_hba.conf to disable all logins but your
own.)
5. If pg_dumpall succeeds and produces sane-looking output, then you've
survived. initdb, reload the dump file, re-open for business, go have
a beer. (Recommended: install 7.2.2 and reload into that, not 7.1.*.)
You will probably still need to check for partially-applied recent
transactions, but for the most part you should be OK.
6. If pg_dumpall fails then let us know what the symptoms are, and we'll
see if we can figure out a workaround for whatever the corruption is.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2002-09-21 16:01:28 | NUMERIC's transcendental functions |
Previous Message | Shridhar Daithankar | 2002-09-21 08:44:26 | Re: Improving speed of copy |