From: | Spiros Ioannou <sivann(at)inaccess(dot)com> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Subject: | Re: Lots of stuck queries after upgrade to 9.4 |
Date: | 2015-07-30 13:23:51 |
Message-ID: | CACKh8C8B+SY61gZipO4rBR1jbZLo1DO=MvQtoDGD5tziBdfzYQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I'm very sorry but we don't have a synthetic load generator for our testing
setup, only production and that is on SLA. I would be happy to test the
next release though.
*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com>M: +30 6973-903808T: +30 210-6802-358*
On 29 July 2015 at 13:42, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
>
>> A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
>> seems to be important to have a very large number of connections:
>>
>> pgbench -n -c400 -j4 -T600 -P5
>>
>> That got stuck after a few minutes. I'm using commit_delay=100.
>>
>> Now that I have something to work with, I'll investigate this more
>> tomorrow.
>>
>
> Ok, it seems that this is caused by the same issue that I found with my
> synthetic test case, after all. It is possible to get a lockup because of
> it.
>
> For the archives, here's a hopefully easier-to-understand explanation of
> how the lockup happens. It involves three backends. A and C are insertion
> WAL records, while B is flushing the WAL with commit_delay. The byte
> positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000
> points to the beginning of the page, while the others are later positions
> on the same page. WaitToFinish() is an abbreviation for
> WaitXLogInsertionsToFinish(). "Update pos X" means a call to
> WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to
> ReserveXLogInsertLocation, which returned StartPos A and EndPos B.
>
> Backend A Backend B Backend C
> --------- --------- ---------
> Acquire InsertLock 2
> Reserve 2100-2200
> Calls WaitToFinish()
> reservedUpto is 2200
> sees that Lock 1 is
> free
> Acquire InsertLock 1
> Reserve 2200-2300
> GetXLogBuffer(2200)
> page not in cache
> Update pos 2000
> AdvanceXLInsertBuffer()
> run until about to
> acquire WALWriteLock
> GetXLogBuffer(2100)
> page not in cache
> Update pos 2000
> AdvanceXLInsertBuffer()
> Acquire WALWriteLock
> write out old page
> initialize new page
> Release WALWriteLock
> finishes insertion
> release InsertLock 2
> WaitToFinish() continues
> sees that lock 2 is
> free. Returns 2200.
>
> Acquire WALWriteLock
> Call WaitToFinish(2200)
> blocks on Lock 1,
> whose initializedUpto
> is 2000.
>
> At this point, there is a deadlock between B and C. B is waiting for C to
> release the lock or update its insertingAt value past 2200, while C is
> waiting for WALInsertLock, held by B.
>
> To fix that, let's fix GetXLogBuffer() to always advertise the exact
> position, not the beginning of the page (except when inserting the first
> record on the page, just after the page header, see comments).
>
> This fixes the problem for me. I've been running pgbench for about 30
> minutes without lockups now, while without the patch it locked up within a
> couple of minutes. Spiros, can you easily test this patch in your
> environment? Would be nice to get a confirmation that this fixes the
> problem for you too.
>
> - Heikki
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Melvin Davidson | 2015-07-30 13:42:01 | user connection not recorded? |
Previous Message | Curt Micol | 2015-07-30 13:22:03 | Re: Logical decoding off of a replica? |