From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Spiros Ioannou <sivann(at)inaccess(dot)com>, Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> |
Subject: | Re: Lots of stuck queries after upgrade to 9.4 |
Date: | 2015-07-27 20:52:18 |
Message-ID: | 55B69A02.1050009@iki.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 07/23/2015 02:36 PM, Heikki Linnakangas wrote:
> On 07/23/2015 11:31 AM, Spiros Ioannou wrote:
>> Well, so far with commit_delay=0 no problems. I will report back of couse
>> if something happens, but I believe that the problem may indeed be
>> solved/masked with that setting.
>>
>> Rough description of our setup, or how to reproduce:
>> * Timeseries data in table , say, "measurements", size: 3-4TB, about 1000
>> inserts/second
>> * table measurements also has a trigger on insert to also insert on
>> measurements_a (for daily export purposes)
>> Just the above would cause a stuck query after a few days.
>>
>> Now for exporting we run the following CTE query (measurements_b is an
>> empty table, measurements_a has about 5GB)
>> * WITH d_rows AS (DELETE FROM measurement_events_a RETURNING * ) INSERT
>> INTO measurement_events_b SELECT * FROM d_rows;
>> The above caused the problem to appear every time, after a 10-20 minutes.
Hmm. With that CTE query, were there other queries running at the same time?
> I was able to reproduce something like this with pgbench, by running a
> custom little module that calls the WaitXLogInsertionsToFinish() in a
> tight loop, and checks that the value it returns moves monotonically
> forward. With commit_delay on, once every minute or so, it moves backwards.
>
> I'll investigate why that happens...
I was able to debug the synthetic test case I created, but unfortunately
I don't think it explains the lock up you're seeing after all.
It's possible for WaitXLogInsertionsToFinish() to move backwards, in
this scenario:
1. Backend A acquires WALInsertLock 2, and reserves xlog between byte
positions 2100 - 2200
2. Backend B calls WaitXLogInsertionsToFinish(), which blocks on backend
A, which hasn't advertised any location yet.
3. Backend C acquires WALInsertLock 1, and reserves xlog between byte
positions 2200 - 2300
4. Backend C calls GetXLogBuffer(), and sees that the page is not in
cache yet. (It does not call WALInsertLockUpdateInsertingAt() yet,
because it's a bit slow or context-switched out)
5. Backend A initializes the page, completes inserting its WAL record,
and releases its WALInsertLock.
6. Backend B gets unblocked, seeing that the lock held by B is now free.
It calculated 2200 as the return value, which was the latest reserved
WAL position. (Backend C started after it began, so it didn't have to
wait for it)
7. Backend C calls WALInsertLockUpdateInsertingAt(), with a WAL position
pointing to the beginning of the page, 2000.
If you now call WALInsertLockUpdateInsertingAt() again, it will return
2000, because backend C is the only backend holding a lock, and its
advertised position is 2000. But the previous call calculated 2200.
GetXLogBuffer() always advertises a WAL position at the beginning of the
requested page, but that's a bit bogus.
However, AFAICS that is actually harmless. Backend C is not blocked. The
page it's looking for is certainly in cache at this point, so it can
continue without blocking. So I don't think this explains your lockup.
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Sanjaya Vithanagama | 2015-07-28 00:36:52 | Re: Splitting a string containing a numeric value in to three parts |
Previous Message | dinesh kumar | 2015-07-27 20:02:15 | Re: Sorting CSV string and removing Duplicates |