Re: Lots of stuck queries after upgrade to 9.4

From: Spiros Ioannou <sivann(at)inaccess(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Lots of stuck queries after upgrade to 9.4
Date: 2015-07-30 19:02:47
Message-ID: CACKh8C94k_kttBHDyr87-CFuu_G85FHvvk1puoc3ezGvjVbLVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

That would need a replica of the data probably which is not possible
(tablespace is 4TB).

*Spiros Ioannou IT Manager, inAccesswww.inaccess.com
<http://www.inaccess.com>M: +30 6973-903808T: +30 210-6802-358*

On 30 July 2015 at 21:47, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:

> You might want to try pg replay: http://laurenz.github.io/pgreplay/
>
> On Thu, Jul 30, 2015 at 7:23 AM, Spiros Ioannou <sivann(at)inaccess(dot)com>
> wrote:
>
>> I'm very sorry but we don't have a synthetic load generator for our
>> testing setup, only production and that is on SLA. I would be happy to test
>> the next release though.
>>
>>
>>
>>
>>
>>
>>
>>
>> *Spiros Ioannou IT Manager, inAccesswww.inaccess.com
>> <http://www.inaccess.com>M: +30 6973-903808T: +30 210-6802-358*
>>
>> On 29 July 2015 at 13:42, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>
>>> On 07/28/2015 11:36 PM, Heikki Linnakangas wrote:
>>>
>>>> A-ha, I succeeded to reproduce this now on my laptop, with pgbench! It
>>>> seems to be important to have a very large number of connections:
>>>>
>>>> pgbench -n -c400 -j4 -T600 -P5
>>>>
>>>> That got stuck after a few minutes. I'm using commit_delay=100.
>>>>
>>>> Now that I have something to work with, I'll investigate this more
>>>> tomorrow.
>>>>
>>>
>>> Ok, it seems that this is caused by the same issue that I found with my
>>> synthetic test case, after all. It is possible to get a lockup because of
>>> it.
>>>
>>> For the archives, here's a hopefully easier-to-understand explanation of
>>> how the lockup happens. It involves three backends. A and C are insertion
>>> WAL records, while B is flushing the WAL with commit_delay. The byte
>>> positions 2000, 2100, 2200, and 2300 are offsets within a WAL page. 2000
>>> points to the beginning of the page, while the others are later positions
>>> on the same page. WaitToFinish() is an abbreviation for
>>> WaitXLogInsertionsToFinish(). "Update pos X" means a call to
>>> WALInsertLockUpdateInsertingAt(X). "Reserve A-B" means a call to
>>> ReserveXLogInsertLocation, which returned StartPos A and EndPos B.
>>>
>>> Backend A Backend B Backend C
>>> --------- --------- ---------
>>> Acquire InsertLock 2
>>> Reserve 2100-2200
>>> Calls WaitToFinish()
>>> reservedUpto is 2200
>>> sees that Lock 1 is
>>> free
>>> Acquire InsertLock 1
>>> Reserve 2200-2300
>>> GetXLogBuffer(2200)
>>> page not in cache
>>> Update pos 2000
>>> AdvanceXLInsertBuffer()
>>> run until about to
>>> acquire WALWriteLock
>>> GetXLogBuffer(2100)
>>> page not in cache
>>> Update pos 2000
>>> AdvanceXLInsertBuffer()
>>> Acquire WALWriteLock
>>> write out old page
>>> initialize new page
>>> Release WALWriteLock
>>> finishes insertion
>>> release InsertLock 2
>>> WaitToFinish() continues
>>> sees that lock 2 is
>>> free. Returns 2200.
>>>
>>> Acquire WALWriteLock
>>> Call WaitToFinish(2200)
>>> blocks on Lock 1,
>>> whose initializedUpto
>>> is 2000.
>>>
>>> At this point, there is a deadlock between B and C. B is waiting for C
>>> to release the lock or update its insertingAt value past 2200, while C is
>>> waiting for WALInsertLock, held by B.
>>>
>>> To fix that, let's fix GetXLogBuffer() to always advertise the exact
>>> position, not the beginning of the page (except when inserting the first
>>> record on the page, just after the page header, see comments).
>>>
>>> This fixes the problem for me. I've been running pgbench for about 30
>>> minutes without lockups now, while without the patch it locked up within a
>>> couple of minutes. Spiros, can you easily test this patch in your
>>> environment? Would be nice to get a confirmation that this fixes the
>>> problem for you too.
>>>
>>> - Heikki
>>>
>>>
>>
>
>
> --
> To understand recursion, one must first understand recursion.
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2015-07-30 19:05:38 Re: [GENERAL] How Many PG_Locks are considered too many
Previous Message Scott Marlowe 2015-07-30 18:47:34 Re: Lots of stuck queries after upgrade to 9.4