Re: Connections hang indefinitely while taking a LWLockTranche buffer_content lock.

From: Chris Hanks <christopher(dot)m(dot)hanks(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Connections hang indefinitely while taking a LWLockTranche buffer_content lock.
Date: 2017-10-26 23:42:23
Message-ID: CAK7KUdA96Oyrkf66Hh97-Y37j0y9d-0ZVGZ5-rSvNiMOWkntNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I wound up working around the issue by forking the database and
removing that GIN index, and things have been fine with the new
instance for the past two days.

I previously had two Postgres instances with hung processes, one 9.6.1
and one 9.6.5. For work reasons I destroyed the 9.6.5 (Heroku support
was having issues with its container anyway) but I've kept the 9.6.1.
I'm happy to give permission for you to access it, just let me know
who to talk to, or have them reach out to me. I have an open issue in
their tracker under my work email, chris(at)pathgather(dot)com(dot)

Thanks again!

On Thu, Oct 26, 2017 at 7:14 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> I managed to get a couple of hours to look at this this afternoon.
>
> On Mon, Oct 23, 2017 at 10:39 AM, Chris Hanks
> <christopher(dot)m(dot)hanks(at)gmail(dot)com> wrote:
>> I'm not sure if you guys had noticed this already, but just in case,
>> those two hung connections weren't making the inserts at exactly the
>> same time - if you look at pg_stat_activity they executed about a day
>> apart (since Heroku cycles the clients every 24 hours or so). And
>> before I restarted the first DB that experienced the problem, there
>> were ~30 backends built up over the course of a month. It seems like
>> when one INSERT sticks, every following INSERT just stacks up on top
>> of it, trying to take out the same lock.
>
> Right. In both backtraces, we see that we're an inserter stuck on
> getting an exclusive buffer lock on the buffer containing block 0, the
> meta page block (GIN_METAPAGE_BLKNO). There is probably some
> session/backend that has acquired two buffer locks in an order that is
> inconsistent with these inserters, meaning that you get an
> undetectable deadlock. (The only alternative to that theory is that
> some backend sits on a meta page buffer lock for some other reason,
> but that seems much less likely.)
>
> The interesting question to my mind is: What backend is the other
> backend that acquires buffer locks in an incompatible order, resulting
> in this undetectable deadlock? What's it doing? (I worked for Heroku
> until quite recently; I may be able to open a back channel, with
> Chris' permission.)
>
> I remember expressing befuddlement about commit e95680832854c over a
> year ago, and never satisfying myself that it was correct [1]. I'm no
> expert on GIN, so I dropped it. It feels like that might be relevant
> here, since you seem to be using this GIN index with a queue table.
> That usage pattern is one where entire leaf pages in indexes tend to
> be routinely deleted and later recycled by VACUUM, at least with
> B-Trees [2]. Whereas, in general I think B-Tree (and presumably GIN)
> page deletion is fairly rare, since the entire page must be empty for
> it to happen.
>
> The follow up bugfix commit, e2c79e14, added a ConditionalLockPage()
> to the insert ginInsertCleanup() path, while also adding a LockPage()
> to the VACUUM path. In case you missed it, those are *heavyweight*
> page lock acquisitions, not buffer lock acquisitions, which is pretty
> unconventional (I though only hash still did that). Frankly, the
> e2c79e14 fix seems kind of bolted on (though I don't want to focus on
> that aspect right now).
>
> [1] https://postgr.es/m/CAM3SWZSDxqDBvUGOoNm0veVOwgJV3GDvoncYr6f5L16qo8MYRg@mail.gmail.com
> [2] https://brandur.org/postgres-queues
> --
> Peter Geoghegan

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Langote 2017-10-27 00:49:16 Re: BUG #14873: table_constraint description missing in ALTER TABLE synopsis
Previous Message Peter Geoghegan 2017-10-26 23:14:19 Re: Connections hang indefinitely while taking a LWLockTranche buffer_content lock.