From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Peter Geoghegan <pg(at)heroku(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Failure while inserting parent tuple to B-tree is not fun |
Date: | 2014-01-27 18:27:38 |
Message-ID: | 52E6A51A.7030006@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 01/23/2014 11:36 PM, Peter Geoghegan wrote:
> The first thing I noticed about this patchset is that it completely
> expunges btree_xlog_startup(), btree_xlog_cleanup() and
> btree_safe_restartpoint(). The post-recovery cleanup that previously
> occurred to address both sets of problems (the problem addressed by
> this patch, incomplete page splits, and the problem addressed by the
> dependency patch, incomplete page deletions) now both occur
> opportunistically/lazily. Notably, this means that there are now
> exactly zero entries in the resource manager list with a
> 'restartpoint' callback. I guess we should consider removing it
> entirely for that reason. As you said in the commit message where
> gin_safe_restartpoint was similarly retired (commit 631118fe):
>
> """
> The post-recovery cleanup mechanism was never totally reliable, as insertion
> to the parent could fail e.g because of running out of memory or disk space,
> leaving the tree in an inconsistent state.
> """
>
> I'm of the general impression that post-recovery cleanup is
> questionable. I'm surprised that you didn't mention this commit
> previously. You just mentioned the original analogous work on GiST,
> but this certainly seems to be something you've been thinking about
> *systematically* eliminating for a while.
Yes, that's correct, these b-tree patches are part of a grand plan to
eliminate post-recovery cleanup actions altogether.
> So while post-recovery callbacks no longer exist for any
> rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
> concern the simple management of memory of AM-specific recovery
> contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
> a better abstraction than that, such as a generic recovery memory
> context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
> just calls those callbacks for each resource at exactly the same time
> anyway, just as it initializes resource managers in precisely the same
> manner earlier on. Plus if you look at what those AM-local memory
> management routines do, it all seems very simple.
>
> I think you should bump XLOG_PAGE_MAGIC.
>
> Perhaps you should increase the elevel here:
>
> + elog(DEBUG1, "finishing incomplete split of %u/%u",
> + BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
>
> Since this is a very rare occurrence that involves some subtle
> interactions, I can't think why you wouldn't want to LOG this for
> forensic purposes.
Hmm. I'm not sure I agree with that line of thinking in general - what
is an admin supposed to do about the message? But there's a precedent in
vacuumlazy.c, which logs a message when it finds all-zero pages in the
heap. So I guess that should be a LOG.
> Why did you remove the local linkage of _bt_walk_left(), given that it
> is called in exactly the same way as before? I guess that that is just
> a vestige of some earlier revision.
Yeah, reverted.
> I think I see some bugs in _bt_moveright(). If you examine
> _bt_finish_split() in detail, you'll see that it doesn't just drop the
> write buffer lock that the caller will have provided (per its
> comments) - it also drops the buffer pin. It calls _bt_insert_parent()
> last, which was previously only called last thing by _bt_insertonpg()
> (some of the time), and _bt_insertonpg() is indeed documented to drop
> both the lock and the pin. And if you look at _bt_moveright() again,
> you'll see that there is a tacit assumption that the buffer pin isn't
> dropped, or at least that "opaque" (the BTPageOpaque BT special page
> area pointer) continues to point to the same page held in the same
> buffer, even though the code doesn't set the "opaque' pointer again
> and it may not point to a pinned buffer or even the appropriate
> buffer. Ditto "page". So "opaque" may point to the wrong thing on
> subsequent iterations - you don't do anything with the return value of
> _bt_getbuf(), unlike all of the existing call sites. I believe you
> should store its return value, and get the held page and the special
> pointer on the page, and assign that to the "opaque" pointer before
> the next iteration (an iteration in which you very probably really do
> make progress not limited to completing a page split, the occurrence
> of which the _bt_moveright() loop gets "stuck on"). You know, do what
> is done in the non-split-handling case. It may not always be the same
> buffer as before, obviously.
Yep, fixed.
> Do you suppose that there are similar problems in other direct callers
> of _bt_finish_split()? I'm thinking in particular of
> _bt_findinsertloc().
Hmm, no, the other callers look OK to me.
> I'm also not sure about the lock escalation that may occur within
> _bt_moveright() for callers with BT_READ access - there is nothing to
> prevent a caller from ending up being left with a write lock where
> before they only had a read lock if they find that
> !P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and
> conclude that there is no split finishing to be done after all. It
> looks like all callers currently won't care about that, but it seems
> like that should either be prominently documented, or just not occur.
> These interactions are complex enough that we ought to be very
> explicit about where pins are required and dropped, or locks held
> before or after calling.
I haven't looked into this in detail yet, but I admit I don't much like
the _bt_moveright() function signature. It's strange to pass 'access'
and 'forupdate' as separate arguments, and it's not totally clear what
combinations make sense and what they mean. Some kind of refactoring is
probably in order.
Here's a new version, rebased over the latest page-deletion patch,
fix-btree-page-deletion-3.patch
(http://www.postgresql.org/message-id/52E66E84.2050109@vmware.com) I
haven't tested this extensively, but passes "make check"...
- Heikki
Attachment | Content-Type | Size |
---|---|---|
btree-incomplete-split-3.patch | text/x-diff | 45.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Fujii Masao | 2014-01-27 18:42:05 | Re: [PATCH] Support for pg_stat_archiver view |
Previous Message | Simon Riggs | 2014-01-27 18:00:54 | Re: A better way than tweaking NTUP_PER_BUCKET |