Quick Links

Re: Race condition in b-tree page deletion

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Kevin Grittner <kgrittn(at)ymail(dot)com>
Cc:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Race condition in b-tree page deletion
Date:	2014-01-28 01:55:46
Message-ID:	CAM3SWZRH2hXWBC+Q5DaYvSaRHV+u3pPfY=kNGpBBkPY+xwqiPA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, Jan 18, 2014 at 11:45 AM, Kevin Grittner <kgrittn(at)ymail(dot)com> wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
>
>> Here's a patch implementing this scheme.

I thought I'd weigh in here too, since this is closely tied to the
page split patch, which is dependent on this patch.

> The basic approach is sound. The papers on which we based our
> btree implementation did not discuss entry deletion, and our
> current approach introduces new possible states into the on-disk
> index structure which are not covered by the papers and are not
> always handled correctly by the current code.

Lehman and Yao don't discuss deletion. But V. Lanin and D. Shasha. do,
in the paper "A Symmetric Concurrent B-Tree Algorithm"[1], as the
README notes - we currently follow a "simplified" version of that.
Apparently a number of people proposed solutions to address the
omission of L&Y, as one survey of those approaches notes [2]. One of
the approaches appearing in that survey is L&S.

The classic L&Y algorithm only has right-links, and not left-links.
The same applies to L&S. But I'd describe this patch as bringing what
we do closer to L&S, which is more or less a refinement of B-Link
trees (that is, L&Y B-Trees). L&S does have a concept of an outlink,
crucial to their "symmetric deletion approach", which is something
that we don't need, because we already have left-links (we have them
primarily to support backwards index scans). L&S says "In general, the
link technique calls for processes that change sections of a data
structure in a way that would cause other processes to fail to provide
a link to another section of the data structure where recovery is
possible". So loosely speaking we're doing a "conceptual
_bt_moveleft()" for deletion as the inverse of a literal _bt_moveright
for insertion.

L&S says "a deletion needs no more than two write locks at a time
during its ascent" (the descent is just a garden variety _bt_search()
insertion scankey descent). However, we currently may obtain as many
as 4 write locks for "page deletion". As the README notes:

"""
To delete an empty page, we acquire write lock on its left sibling (if
any), the target page itself, the right sibling (there must be one), and
the parent page, in that order.
"""

We obtain 2 write locks in the first phase (on target and parent). In
the second phase, we do what is described above: lock 4 pages in that
order. However, the reason for this difference from the L&S paper is
made clear in "2.5 Freeing Empty nodes", which kind of cops out of
addressing how to *unlink* empty/half-dead pages, with a couple of
brief descriptions of approaches that others have come up with that
are not at all applicable to us.

For that reason, might I suggest that you change this in the README:

+ Page Deletion
+ -------------

I suggest that it should read "Page Deletion and Unlinking" to
emphasize the distinction.

> The general feeling is that this patch is not suitable for
> back-patching. I agree with this, but this is a bug which could
> lead to index corruption which exists in all supported branches.
> If nobody can suggest an alternative which we can back-patch, we
> might want to reconsider this after this has been in production
> releases for several months.

That was what I thought. Let's revisit the question of back-patching
this when 9.4 has been released for 6 months or so.

> Other than that, I have not found any problems.

Incidentally, during my research of these techniques, I discovered
that Johnson and Shasha argue that "for most concurrent B-tree
applications, merge-at-empty produces significantly lower
restructuring rates, and only a slightly lower space utilization, than
merge-at-half". This is covered in the paper "B-trees with inserts,
deletes, and modifies: Why Free-at-empty is Better than Merge-at-half"
[3]. I was pleased to learn that there was a sound, well regarded
theoretical basis for that behavior, even if the worst-case bloat can
be unpleasant. Though having said that, that paper doesn't weigh what
we do in the event of non-HOT updates, and that could change the
equation. That isn't an argument against merge-at-empty; it's an
argument against the current handling of non-HOT updates.

Currently, the comments above _bt_doinsert() say at one point:

* This routine is called by the public interface routines, btbuild
* and btinsert.

This is obsolete; since commit 89bda95d, only the insert public
interface routine calls _bt_doinsert(). Perhaps fix this in passing.

[1] L&S: https://archive.org/stream/symmetricconcurr00lani/symmetricconcurr00lani_djvu.txt

[2] http://www.dtic.mil/get-tr-doc/pdf?AD=ADA232287&Location=U2&doc=GetTRDoc.pdf
(Chapter 3, "The Coherent Shared Memory Algorithm")

[3] http://cs.nyu.edu/shasha/papers/bjournal.ps
--
Peter Geoghegan

In response to

Re: Race condition in b-tree page deletion at 2014-01-18 19:45:23 from Kevin Grittner

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	KONDO Mitsumasa	2014-01-28 02:04:36	Re: pgsql: Keep pg_stat_statements' query texts in a file, not in shared me
Previous Message	Peter Geoghegan	2014-01-28 01:35:51	Re: pgsql: Keep pg_stat_statements' query texts in a file, not in shared me