From: | daveg <daveg(at)sonic(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: error: could not find pg_class tuple for index 2662 |
Date: | 2011-08-03 11:57:31 |
Message-ID: | 20110803115731.GA14353@sonic.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Aug 01, 2011 at 01:23:49PM -0400, Tom Lane wrote:
> daveg <daveg(at)sonic(dot)net> writes:
> > On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote:
> >> I think we need to start adding some instrumentation so we can get a
> >> better handle on what's going on in your database. If I were to send
> >> you a source-code patch for the server that adds some more logging
> >> printout when this happens, would you be willing/able to run a patched
> >> build on your machine?
>
> > Yes we can run an instrumented server so long as the instrumentation does
> > not interfere with normal operation. However, scheduling downtime to switch
> > binaries is difficult, and generally needs to be happen on a weekend, but
> > sometimes can be expedited. I'll look into that.
>
> OK, attached is a patch against 9.0 branch that will re-scan pg_class
> after a failure of this sort occurs, and log what it sees in the tuple
> header fields for each tuple for the target index. This should give us
> some useful information. It might be worthwhile for you to also log the
> results of
>
> select relname,pg_relation_filenode(oid) from pg_class
> where relname like 'pg_class%';
>
> in your script that does VACUUM FULL, just before and after each time it
> vacuums pg_class. That will help in interpreting the relfilenodes in
> the log output.
We have installed the patch and have encountered the error as usual.
However there is no additional output from the patch. I'm speculating
that the pg_class scan in ScanPgRelationDetailed() fails to return
tuples somehow.
I have also been trying to trace it further by reading the code, but have not
got any solid hypothesis yet. In the absence of any debugging output I've
been trying to deduce the call tree leading to the original failure. So far
it looks like this:
RelationReloadIndexInfo(Relation)
// Relation is 2662 and !rd_isvalid
pg_class_tuple = ScanPgRelation(2662, indexOK=false) // returns NULL
pg_class_desc = heap_open(1259, ACC_SHARE)
r = relation_open(1259, ACC_SHARE) // locks oid, ensures RelationIsValid(r)
r = RelationIdGetRelation(1259)
r = RelationIdCacheLookup(1259) // assume success
if !rd_isvalid:
RelationClearRelation(r, true)
RelationInitPhysicalAddr(r) // r is pg_class relcache
-dg
--
David Gould daveg(at)sonic(dot)net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitri Fontaine | 2011-08-03 12:38:56 | Re: Transient plans versus the SPI API |
Previous Message | Peter Geoghegan | 2011-08-03 11:44:40 | Re: Further news on Clang - spurious warnings |