| From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
|---|---|
| To: | Greg Stark <stark(at)mit(dot)edu> |
| Cc: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Maciek Sakrejda <m(dot)sakrejda(at)gmail(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org> |
| Subject: | Re: BUG #10432: failed to re-find parent key in index |
| Date: | 2014-06-04 11:35:19 |
| Message-ID: | 20140604113519.GG1220@awork2.anarazel.de |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-bugs |
Hi,
On 2014-06-04 12:14:27 +0100, Greg Stark wrote:
> Ok, I made some progress. It turns out this was a pre-existing problem
> in the master. They've been getting "failed to re-find parent" errors
> for weeks. Far longer than I have any WAL or backups for.
Ok.
> 1) Failed to re-find parent should perhaps not be FATAL to recovery.
> In fact any index replay error would really be nice not to have to
> crash on.
I think that's not really realistic. We'd need to put a significant
amount of machinery for this in to be workable. Suddenly a crash restart
doesn't guarantee that you're indexes are there anymore? Not nice.
> All crashing does is prevent the user from being able to
> bring up their database and REINDEX the btree. This may be another use
> case for the machinery that would protect against corrupt hash indexes
> or user-defined indexes -- if we could mark the index invalid and
> proceed (perhaps ignoring subsequent records for it) that would be
> great.
>
> 2) When we see an abort record we could check for any cleanup actions
> triggered by that transaction and run them right away. I think the
> checkpoints (and maybe hot standby snapshots or vacuum cleanup
> records?) also include information about the oldest xid running, they
> would also let us prune the cleanup actions sooner. That would at
> least find the error sooner. In conjunction with (1) it would also
> mean subsequent restartpoints would be effective instead of
> suppressing restartpoints right to the end of recovery.
Heikki removed restartpoints from 9.4 alltogether so most of these are
gone. As all these -even if they were doable - sound far too large for
backpatching I think it's luckily mostly done.
> 3) The lack of logs around an error during recovery makes it hard to
> decipher what's going on. It would be nice to see "Beginning Xlog
> cleanup (1 incomplete splits to replay)" and when it crashed "Last
> safe point to restart recovery is 324/ABCDEF". As it was it was a
> pretty big mystery why the database crashed, the logs made it appear
> as if it had started up fine. And it was unclear why restarting it
> caused it to replay from the beginning, I thought maybe something was
> wrong with our scripts.
I think this should be fixed by setting up error context stack support
in two places. a) in StartupXLOG() before the rm_cleanup() calls b) in <
9.4 inside the individual cleanup routines.
We do all that around redo routines, but, as evidenced here, that's not
always enough.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Heikki Linnakangas | 2014-06-04 12:06:19 | Re: BUG #10432: failed to re-find parent key in index |
| Previous Message | Greg Stark | 2014-06-04 11:26:17 | Re: BUG #10432: failed to re-find parent key in index |