Re: Lots of stuck queries after upgrade to 9.4

From: Spiros Ioannou <sivann(at)inaccess(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Subject: Re: Lots of stuck queries after upgrade to 9.4
Date: 2015-07-28 07:22:37
Message-ID: CACKh8C9kz=iLncE4R_zR5qcXrREOSKeFWkO4i=d=J6yHd5ufMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>
> Hmm. With that CTE query, were there other queries running at the same
> time?

yes, a multitude of INSERT and one COMMIT as I remember. I've noticed that
in every stuck situation there was always 1 stuck COMMIT running. We do
mainly INSERT and SELECT, very rarely UPDATE or DELETE.
but the stuck queries happen also without the CTE. The trigger on insert
runs non-stop though.

I think I've noticed a different behaviour with 9.4 and the CTE but I'm
just 50% sure about that, but may be relevant or not:
in 9.3 while the CTE was running, I think the trigger-filled table was
locked and no inserts could be made to the primary table (which copies all
inserts to the trigger table). In 9.4 inserts can be made at a lower rate
even when CTE runs. As I said not 100% sure about the behaviour was like
that in 9.3.

I was able to debug the synthetic test case I created, but unfortunately I
> don't think it explains the lock up you're seeing after all.
>
> It's possible for WaitXLogInsertionsToFinish() to move backwards, in this
> scenario:
>
> 1. Backend A acquires WALInsertLock 2, and reserves xlog between byte
> positions 2100 - 2200
> 2. Backend B calls WaitXLogInsertionsToFinish(), which blocks on backend
> A, which hasn't advertised any location yet.
> 3. Backend C acquires WALInsertLock 1, and reserves xlog between byte
> positions 2200 - 2300
> 4. Backend C calls GetXLogBuffer(), and sees that the page is not in cache
> yet. (It does not call WALInsertLockUpdateInsertingAt() yet, because it's a
> bit slow or context-switched out)
> 5. Backend A initializes the page, completes inserting its WAL record, and
> releases its WALInsertLock.
> 6. Backend B gets unblocked, seeing that the lock held by B is now free.
> It calculated 2200 as the return value, which was the latest reserved WAL
> position. (Backend C started after it began, so it didn't have to wait for
> it)
> 7. Backend C calls WALInsertLockUpdateInsertingAt(), with a WAL position
> pointing to the beginning of the page, 2000.
>
> If you now call WALInsertLockUpdateInsertingAt() again, it will return
> 2000, because backend C is the only backend holding a lock, and its
> advertised position is 2000. But the previous call calculated 2200.
> GetXLogBuffer() always advertises a WAL position at the beginning of the
> requested page, but that's a bit bogus.
>
> However, AFAICS that is actually harmless. Backend C is not blocked. The
> page it's looking for is certainly in cache at this point, so it can
> continue without blocking. So I don't think this explains your lockup.

Thanks for the investigation. The only think I can offer is that it's still
running fine with commit_delay = 0. Perhaps a warning should be put on the
documentation until the cause is found.
I've found a similar issue online:
http://dba.stackexchange.com/questions/96957/postgresql-9-4-1-stuck-all-queries-when-making-multi-updates
thanks
-Spiros

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Herouth Maoz 2015-07-28 07:27:11 Re: *SOLVED* Connections closing due to "terminating connection due to administrator command"
Previous Message 林士博 2015-07-28 05:30:25 Re: Using the database to validate data