Quick Links

Re: BUG #10432: failed to re-find parent key in index

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: BUG #10432: failed to re-find parent key in index
Date:	2014-06-04 11:35:19
Message-ID:	20140604113519.GG1220@awork2.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

Hi,

On 2014-06-04 12:14:27 +0100, Greg Stark wrote:
> Ok, I made some progress. It turns out this was a pre-existing problem
> in the master. They've been getting "failed to re-find parent" errors
> for weeks. Far longer than I have any WAL or backups for.

Ok.

> 1) Failed to re-find parent should perhaps not be FATAL to recovery.
> In fact any index replay error would really be nice not to have to
> crash on.

I think that's not really realistic. We'd need to put a significant
amount of machinery for this in to be workable. Suddenly a crash restart
doesn't guarantee that you're indexes are there anymore? Not nice.

> All crashing does is prevent the user from being able to
> bring up their database and REINDEX the btree. This may be another use
> case for the machinery that would protect against corrupt hash indexes
> or user-defined indexes -- if we could mark the index invalid and
> proceed (perhaps ignoring subsequent records for it) that would be
> great.
>
> 2) When we see an abort record we could check for any cleanup actions
> triggered by that transaction and run them right away. I think the
> checkpoints (and maybe hot standby snapshots or vacuum cleanup
> records?) also include information about the oldest xid running, they
> would also let us prune the cleanup actions sooner. That would at
> least find the error sooner. In conjunction with (1) it would also
> mean subsequent restartpoints would be effective instead of
> suppressing restartpoints right to the end of recovery.

Heikki removed restartpoints from 9.4 alltogether so most of these are
gone. As all these -even if they were doable - sound far too large for
backpatching I think it's luckily mostly done.

> 3) The lack of logs around an error during recovery makes it hard to
> decipher what's going on. It would be nice to see "Beginning Xlog
> cleanup (1 incomplete splits to replay)" and when it crashed "Last
> safe point to restart recovery is 324/ABCDEF". As it was it was a
> pretty big mystery why the database crashed, the logs made it appear
> as if it had started up fine. And it was unclear why restarting it
> caused it to replay from the beginning, I thought maybe something was
> wrong with our scripts.

I think this should be fixed by setting up error context stack support
in two places. a) in StartupXLOG() before the rm_cleanup() calls b) in <
9.4 inside the individual cleanup routines.
We do all that around redo routines, but, as evidenced here, that's not
always enough.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: BUG #10432: failed to re-find parent key in index at 2014-06-04 11:14:27 from Greg Stark

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2014-06-04 12:06:19	Re: BUG #10432: failed to re-find parent key in index
Previous Message	Greg Stark	2014-06-04 11:26:17	Re: BUG #10432: failed to re-find parent key in index