From: | Greg Sabino Mullane <greg(at)endpoint(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Backends stalled in 'startup' state: index corruption |
Date: | 2012-05-24 21:14:53 |
Message-ID: | 20120524211453.GC5533@tinybird.home |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, May 24, 2012 at 03:54:54PM -0400, Tom Lane wrote:
> Did you check I/O activity? I looked again at Jeff Frost's report and
> now think that what he saw was probably a lot of seqscans on bloated
> system catalogs, cf
> http://archives.postgresql.org/message-id/28484.1337887297@sss.pgh.pa.us
Thank you for the response. Yes, we did look at I/O, but nothing unusual
was seen that would explain the load. If it happens again I'll see if the
system catalogs are getting a lot of seqscans.
> That's fairly interesting, but if it was a bloat situation then it
> would've been the VAC FULL that fixed it rather than the REINDEX.
> Did you happen to save the VERBOSE output? It'd be really useful to
> know whether there was any major shrinkage of the core catalogs
> (esp. pg_class, pg_attribute).
I did have them in screen, but the home screen box just craashed a
few hours ago (after weeks of uptime: Murphy's law). It certainly could
have been VAC FULL as we have processes that do both VAC FULL and REINDEX
periodically; I simply assumed based on the HINT it was the REINDEX that
cleared it up. The only snippet I have is:
# vacuum full verbose pg_class;
INFO: vacuuming "pg_catalog.pg_class"
INFO: "pg_class": found 43 removable, 10376 nonremovable row versions in 518 pages
DETAIL: 6078 dead row versions cannot be removed yet.
Nonremovable row versions range from 160 to 628 bytes long.
There were 7367 unused item pointers.
Total free space (including removable row versions) is 157000 bytes.
0 pages are or will become empty, including 0 at the end of the table.
0 pages containing 0 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "pg_class_oid_index" now contains 7712 row versions in 32 pages
DETAIL: 6 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING: index "pg_class_oid_index" contains 7712 row versions, but table contains 9471 row versions
HINT: Rebuild the index with REINDEX.
INFO: index "pg_class_relname_nsp_index" now contains 7712 row versions in 113 pages
DETAIL: 6 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
WARNING: index "pg_class_relname_nsp_index" contains 7712 row versions, but table contains 9471 row versions
HINT: Rebuild the index with REINDEX.
VACUUM
Time: 65.635 ms
Actually, since we have full logging, those warnings should still be
there... (checks) ... yep. Hints elided, the last one seen was:
WARNING: index "pg_class_oid_index" contains 14877 row versions,
but table contains 66648 row versions
WARNING: index "pg_class_relname_nsp_index" contains 14877 row versions,
but table contains 66648 row versions
WARNING: index "pg_class_oid_index" contains 60943 row versions,
but table contains 112714 row versions
WARNING: index "pg_class_relname_nsp_index" contains 60943 row versions,
but table contains 112714 row versions
Looks like there are some more going back a ways: I'll gather together
and send those offlist after this.
> I think there are probably two independent issues here. The missing
> index entries are clearly bad but it's not clear that they had anything
> to do with the startup stall. There are a couple of fixes in recent
> 8.3.x releases that might possibly explain the index corruption,
> especially if you're in the habit of reindexing the system catalogs
> frequently.
Yes, we are in that habit. I hope they are not independent: I'd rather
have one problem to worry about than two. :) I'll certainly report if I
see either problem pop up again.
Oh, almost forgot: reading your reply to the old thread reminded me of
something I saw in one of the straces right as it "woke up" and left
the startup state to do some work. Here's a summary:
12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
12:18:39 semop(4390981, 0x7fff66c4ec10, 1) = 0
(x a gazillion)
...
12:18:40 brk(0x1c0af000) = 0x1c0af000
...(some more semops)...
12:18:40 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062c98000
...(handful of semops)...
12:18:40 unlink("base/1554846571/pg_internal.init.11803") = -1 ENOENT (No such file or directory)
12:18:40 open("base/1554846571/pg_internal.init.11803", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 13
12:18:40 fstat(13, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
12:18:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ac062cd9000
12:18:40 write(13, ...
...(normalish looking strace output after this)...
--
Greg Sabino Mullane greg(at)endpoint(dot)com
End Point Corporation
PGP Key: 0x14964AC8
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Sabino Mullane | 2012-05-24 21:22:59 | Re: Backends stalled in 'startup' state: index corruption |
Previous Message | Merlin Moncure | 2012-05-24 20:46:10 | Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile |