From: | Arjen van der Meijden <acm(at)tweakers(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: Race-condition with failed block-write? |
Date: | 2005-09-13 17:43:06 |
Message-ID: | 43270FAA.20301@tweakers.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 13-9-2005 16:25, Tom Lane wrote:
> Arjen van der Meijden <acm(at)tweakers(dot)net> writes:
>
> It's highly unlikely that that query has anything to do with it, since
> it's not touching anything but system catalogs and not trying to write
> them either.
Indeed, other things trigger it as well.
> The first thing you ought to find out is which table
> 1663/2013826/9975789 is, and look to see if the corrupted LSN value is
> already present on disk in that block.
Well, its an index, not a table. It was the index:
"pg_class_relname_nsp_index" on pg_class(relname, relnamespace).
Using pg_filedump I extracted the LSN for block 21 and indeed, that was
already 67713428 instead of something below 2E73E53C. It wasn't that
block alone though, here are a few LSN-lines from it:
LSN: logid 41 recoff 0x676f5174 Special 8176 (0x1ff0)
LSN: logid 25 recoff 0x3c6c5504 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea88190 Special 8176 (0x1ff0)
LSN: logid 1 recoff 0x68e2f660 Special 8176 (0x1ff0)
LSN: logid 41 recoff 0x2ea8a270 Special 8176 (0x1ff0)
LSN: logid 1 recoff 0x68e2f6a4 Special 8176 (0x1ff0)
I tried other files and each one I tried only had LSN's of 0.
When trying (\d indexname in psql) to determine to which table that
index belonged I noticed it got the errors again, but for another file
(pg_index this time). And another try (oid2name ...) after that, yet
another file (the pg_class-table). All those files where last changed
somewhere August 25, so now new changes.
On that day I did some active query-tuning, but a few times it took too
long, so I issued immediate shut downs when the selects took too long.
There were no warnings about broken records afterwards in the log
though, so I don't believe anything got damaged afterwards.
After that I loaded some fresh data from a production-database using
either pg_restore or psql < some-file-from-pg_dump.sql (I don't know
which one anymore). A few days later I shut down that postgres,
installed 8.1-beta and used that (in another directory of course), this
8.0.3 only came back up because of a reboot and wasn't used since that
reboot.
I guess, during that reloading those system tables got mixed up?
> If it is, then we've probably
> not got much chance of finding out how it got there. If it is *not* on
> disk, but you have a repeatable way of causing this to happen starting
> from a clean postmaster start, then that's pretty interesting --- but
> I don't know any way of figuring it out short of groveling through the
> code with a debugger. If you're not already pretty familiar with the PG
> code, coaching you remotely isn't going to work very well :-(. I'd be
> glad to look into it if you can get me access to the machine though.
Well, I can very probably give you that access. But as you say, finding
out was went wrong is very hard to do.
Best regards,
Arjen van der Meijden
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2005-09-13 18:04:06 | Re: Race-condition with failed block-write? |
Previous Message | Tom Lane | 2005-09-13 16:45:09 | Re: ia64-hp-hpux11.23 configure warnings |