From: | Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> |
---|---|
To: | Spiros Ioannou <sivann(at)inaccess(dot)com> |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Subject: | Re: Lots of stuck queries after upgrade to 9.4 |
Date: | 2015-07-30 18:47:34 |
Message-ID: | CAOR=d=3m3pr4VTZxV4j+NL-tCOL=qHX3qtPFV2+O8T7G0zb3cA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
You might want to try pg replay: http://laurenz.github.io/pgreplay/
On Thu, Jul 30, 2015 at 7:23 AM, Spiros Ioannou <sivann(at)inaccess(dot)com> wrote:
> I'm very sorry but we don't have a synthetic load generator for our
> testing setup, only production and that is on SLA. I would be happy to test
> the next release though.
>
>
>
>
>
>
>
>
> *Spiros Ioannou IT Manager, inAccesswww.inaccess.com
> <http://www.inaccess.com>M: +30 6973-903808T: +30 210-6802-358*
>
> On 29 July 2015 at 13:42, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
>> On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
>>
>>> A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
>>> seems to be important to have a very large number of connections:
>>>
>>> pgbench -n -c400 -j4 -T600 -P5
>>>
>>> That got stuck after a few minutes. I'm using commit_delay=100.
>>>
>>> Now that I have something to work with, I'll investigate this more
>>> tomorrow.
>>>
>>
>> Ok, it seems that this is caused by the same issue that I found with my
>> synthetic test case, after all. It is possible to get a lockup because of
>> it.
>>
>> For the archives, here's a hopefully easier-to-understand explanation of
>> how the lockup happens. It involves three backends. A and C are insertion
>> WAL records, while B is flushing the WAL with commit_delay. The byte
>> positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000
>> points to the beginning of the page, while the others are later positions
>> on the same page. WaitToFinish() is an abbreviation for
>> WaitXLogInsertionsToFinish(). "Update pos X" means a call to
>> WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to
>> ReserveXLogInsertLocation, which returned StartPos A and EndPos B.
>>
>> Backend A Backend B Backend C
>> --------- --------- ---------
>> Acquire InsertLock 2
>> Reserve 2100-2200
>> Calls WaitToFinish()
>> reservedUpto is 2200
>> sees that Lock 1 is
>> free
>> Acquire InsertLock 1
>> Reserve 2200-2300
>> GetXLogBuffer(2200)
>> page not in cache
>> Update pos 2000
>> AdvanceXLInsertBuffer()
>> run until about to
>> acquire WALWriteLock
>> GetXLogBuffer(2100)
>> page not in cache
>> Update pos 2000
>> AdvanceXLInsertBuffer()
>> Acquire WALWriteLock
>> write out old page
>> initialize new page
>> Release WALWriteLock
>> finishes insertion
>> release InsertLock 2
>> WaitToFinish() continues
>> sees that lock 2 is
>> free. Returns 2200.
>>
>> Acquire WALWriteLock
>> Call WaitToFinish(2200)
>> blocks on Lock 1,
>> whose initializedUpto
>> is 2000.
>>
>> At this point, there is a deadlock between B and C. B is waiting for C to
>> release the lock or update its insertingAt value past 2200, while C is
>> waiting for WALInsertLock, held by B.
>>
>> To fix that, let's fix GetXLogBuffer() to always advertise the exact
>> position, not the beginning of the page (except when inserting the first
>> record on the page, just after the page header, see comments).
>>
>> This fixes the problem for me. I've been running pgbench for about 30
>> minutes without lockups now, while without the patch it locked up within a
>> couple of minutes. Spiros, can you easily test this patch in your
>> environment? Would be nice to get a confirmation that this fixes the
>> problem for you too.
>>
>> - Heikki
>>
>>
>
--
To understand recursion, one must first understand recursion.
From | Date | Subject | |
---|---|---|---|
Next Message | Spiros Ioannou | 2015-07-30 19:02:47 | Re: Lots of stuck queries after upgrade to 9.4 |
Previous Message | Adrian Klaver | 2015-07-30 17:22:10 | Re: Question about copy from with timestamp format |