Re: Lots of stuck queries after upgrade to 9.4

From: Spiros Ioannou <sivann(at)inaccess(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Lots of stuck queries after upgrade to 9.4
Date: 2015-07-30 13:23:51
Message-ID: CACKh8C8B+SY61gZipO4rBR1jbZLo1DO=MvQtoDGD5tziBdfzYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I'm very sorry but we don't have a synthetic load generator for our testing
setup, only production and that is on SLA. I would be happy to test the
next release though.

*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com>M: +30 6973-903808T: +30 210-6802-358*

On 29 July 2015 at 13:42, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
>
>> A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
>> seems to be important to have a very large number of connections:
>>
>> pgbench -n -c400 -j4 -T600 -P5
>>
>> That got stuck after a few minutes. I'm using commit_delay=100.
>>
>> Now that I have something to work with, I'll investigate this more
>> tomorrow.
>>
>
> Ok, it seems that this is caused by the same issue that I found with my
> synthetic test case, after all. It is possible to get a lockup because of
> it.
>
> For the archives, here's a hopefully easier-to-understand explanation of
> how the lockup happens. It involves three backends. A and C are insertion
> WAL records, while B is flushing the WAL with commit_delay. The byte
> positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000
> points to the beginning of the page, while the others are later positions
> on the same page. WaitToFinish() is an abbreviation for
> WaitXLogInsertionsToFinish(). "Update pos X" means a call to
> WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to
> ReserveXLogInsertLocation, which returned StartPos A and EndPos B.
>
> Backend A Backend B Backend C
> --------- --------- ---------
> Acquire InsertLock 2
> Reserve 2100-2200
> Calls WaitToFinish()
> reservedUpto is 2200
> sees that Lock 1 is
> free
> Acquire InsertLock 1
> Reserve 2200-2300
> GetXLogBuffer(2200)
> page not in cache
> Update pos 2000
> AdvanceXLInsertBuffer()
> run until about to
> acquire WALWriteLock
> GetXLogBuffer(2100)
> page not in cache
> Update pos 2000
> AdvanceXLInsertBuffer()
> Acquire WALWriteLock
> write out old page
> initialize new page
> Release WALWriteLock
> finishes insertion
> release InsertLock 2
> WaitToFinish() continues
> sees that lock 2 is
> free. Returns 2200.
>
> Acquire WALWriteLock
> Call WaitToFinish(2200)
> blocks on Lock 1,
> whose initializedUpto
> is 2000.
>
> At this point, there is a deadlock between B and C. B is waiting for C to
> release the lock or update its insertingAt value past 2200, while C is
> waiting for WALInsertLock, held by B.
>
> To fix that, let's fix GetXLogBuffer() to always advertise the exact
> position, not the beginning of the page (except when inserting the first
> record on the page, just after the page header, see comments).
>
> This fixes the problem for me. I've been running pgbench for about 30
> minutes without lockups now, while without the patch it locked up within a
> couple of minutes. Spiros, can you easily test this patch in your
> environment? Would be nice to get a confirmation that this fixes the
> problem for you too.
>
> - Heikki
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Melvin Davidson 2015-07-30 13:42:01 user connection not recorded?
Previous Message Curt Micol 2015-07-30 13:22:03 Re: Logical decoding off of a replica?