Re: Theory about XLogFlush startup failures

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp>
Cc: Vadim Mikheev <vmikheev(at)sectorbase(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Theory about XLogFlush startup failures
Date: 2002-01-15 02:49:49
Message-ID: 21581.1011062989@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hiroshi Inoue <Inoue(at)tpf(dot)co(dot)jp> writes:
> BTW doesn't the LSN corruption imply the possibility
> of the corruption of other parts (of e.g. pg_log) ?

Indeed. Not sure what we can do about that.

In the case I examined (Jeff Lu's recent problem), pg_log appeared
perfectly valid up through the end of the page containing the last
transaction ID recorded in pg_control. However, this ID was close to
the end of the page, and WAL entries contained XIDs reaching into the
next page of pg_log. That page contained complete garbage. Even
more interesting, there was about 400K of complete garbage beyond that
page, in pages that Postgres should never have touched at all. (This
seemed like a lot, considering the valid part of pg_log was less than
200K.)

My bet is that the garbaged pages were there before Postgres got to
them. Both normal operation and WAL recovery would've died at the first
attempt to write out the first garbage page, because of its bad LSN.
Also, AFAICT 7.1 and before contained no explicit code to zero a newly
used pg_log page (it relied on the smgr to fill in zeroes when reading
beyond EOF); nor did the pg_log updating code stop to notice whether the
transaction status bits it was about to overwrite looked sane. So there
would've been no notice before trying to write the garbage page back out.
(These last two holes, at least, are plugged in 7.2. But if the OS
gives us back a page of garbage instead of the data we wrote out, how
well can we be expected to survive that?)

Since Jeff was running on a Cygwin/Win2K setup, I'm quite happy to write
this off as an OS hiccup, unless someone can think of a mechanism inside
Postgres that could have provoked it.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2002-01-15 02:52:23 Re: Problem reloading regression database
Previous Message Lincoln Yeoh 2002-01-15 02:44:04 Re: bug in permission handling?