Re: hung backends stuck in spinlock heavy endless loop

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hung backends stuck in spinlock heavy endless loop
Date: 2015-01-15 08:40:27
Message-ID: CAM3SWZR98iYf7jpOwoKRba0sf-cB5JOCNT3pg5YT5bjgBF3Sqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 14, 2015 at 8:50 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> I am mistaken on one detail here - blocks 2 and 9 are actually fully
> identical. I still have no idea why, though.

So, I've looked at it in more detail and it appears that the page of
block 2 split at some point, thereby creating a new page (the block 9
page). There is a sane downlink in the root page for the new rightlink
page. The root page looks totally sane, as does every other page - as
I said, the problem is only that block 9 is spuriously identical to
block 2. So the (correct) downlink in the root page, to block 9, is
the same as the (incorrect) high key value in block 9 - Oid value
69924. To be clear: AFAICT everything is perfect except block 9, which
is bizarrely identical to block 2.

Now, since the sane page downlink located in the root (like every
downlink, a lower bound on items in its child) is actually a copy of
the high key on the page that is the child's left link (that is to
say, it comes from the original target of a page split - it shares the
target's high key value, Oid value 69924), there may have never been
sane data in block 9, even though its downlink is sane (so maybe the
page split patch is implicated). But it's hard to see how that could
be true. The relevant code wasn't really what was changed about page
splits in 9.4 anyway (plus this wasn't a non-leaf split, since there
aren't enough pages for those to be a factor). There just isn't that
many items on page 2 (or its bizarre identical twin, page 9), so a
recent split seems unlikely. And, the target and new right page are
locked together throughout both the split and down link insertion
(even though there are two atomic operations/WAL inserts). So to
reiterate, a close by page split that explains the problem seems
unlikely.

I'm going to focus on the page deletion patch for the time being.
Merlin - it would be great if you could revert all the page split
commits (which came after the page deletion fix). All the follow-up
page split commits [1] were fairly straightforward bugs with recovery,
so it should be easy enough to totally remove the page split stuff
from 9.4 for the purposes of isolating the bug.

[1] http://www.postgresql.org/message-id/CAM3SWZSpJ6M9HfHksjUiUof30aUWXyyb56FJbW1_doGQKbEO+g@mail.gmail.com
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Sameer Kumar 2015-01-15 08:43:10 Re: Check that streaming replica received all data after master shutdown
Previous Message Michael Paquier 2015-01-15 08:08:43 Re: advance local xmin more aggressively