Re: Deadlock in XLogInsert at AIX

From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Deadlock in XLogInsert at AIX
Date: 2017-01-24 16:26:08
Message-ID: 100ec553-b398-7730-c676-57a5e375cb68@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

More information about the problem - Postgres log contains several records:

2017-01-24 19:15:20.272 MSK [19270462] LOG: request to flush past end
of generated WAL; request 6/AAEBE000, currpos 6/AAEBC2B0

and them correspond to the time when deadlock happen.
There is the following comment in xlog.c concerning this message:

/*
* No-one should request to flush a piece of WAL that hasn't even been
* reserved yet. However, it can happen if there is a block with a
bogus
* LSN on disk, for example. XLogFlush checks for that situation and
* complains, but only after the flush. Here we just assume that to
mean
* that all WAL that has been reserved needs to be finished. In this
* corner-case, the return value can be smaller than 'upto' argument.
*/

So looks like it should not happen.
The first thing to suspect is spinlock implementation which is different
for GCC and XLC.
But ... if I rebuild Postgres without spinlocks, then the problem is
still reproduced.

On 24.01.2017 17:47, Konstantin Knizhnik wrote:
> Hi Hackers,
>
> We are running Postgres at AIX and encoountered two strqange problems:
> active zombies process and deadlock in XLOG writer.
> First problem I will explain in separate mail, now I am mostly
> concerning about deadlock.
> It is irregularly reproduced with standard pgbench launched with 100
> connections.
>
> It sometimes happens with 9.6 stable version of Postgres but only when
> it is compiled with xlc compiler.
> We failed to reproduce the problem with GCC. So it looks like as bug
> in compiler or xlc-specific atomics implementation...
> But there are few moments which contradicts with this hypothesis:
>
> 1. The problem is reproduce with Postgres built without optimization.
> Usually compiler bugs affect only optimized code.
> 2. Disabling atomics doesn't help.
> 3. Without optimization and with LOCK_DEBUG defined time of
> reproducing the problem significantly increased. With optimized code
> it is almost always reproduced in few minutes.
> With debug version it usually takes much more time.
>
> But the most confusing thing is stack trace:
>
> (dbx) where
> semop(??, ??, ??) at 0x9000000001f5790
> PGSemaphoreLock(sema = 0x0a00000044b95928), line 387 in "pg_sema.c"
> unnamed block in LWLockWaitForVar(lock = 0x0a0000000000d980, valptr =
> 0x0a0000000000d9a8, oldval = 102067874256, newval =
> 0x0fffffffffff9c10), line 1666 in "lwlock.c"
> LWLockWaitForVar(lock = 0x0a0000000000d980, valptr =
> 0x0a0000000000d9a8, oldval = 102067874256, newval =
> 0x0fffffffffff9c10), line 1666 in "lwlock.c"
> unnamed block in WaitXLogInsertionsToFinish(upto = 102067874328), line
> 1583 in "xlog.c"
> WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c"
> AdvanceXLInsertBuffer(upto = 102067874256, opportunistic = '\0'), line
> 1916 in "xlog.c"
> unnamed block in GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c"
> GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c"
> CopyXLogRecordToWAL(write_len = 70, isLogSwitch = '\0', rdata =
> 0x000000011007ce10, StartPos = 102067874256, EndPos = 102067874328),
> line 1279 in "xlog.c"
> XLogInsertRecord(rdata = 0x000000011007ce10, fpw_lsn = 102067718328),
> line 1011 in "xlog.c"
> unnamed block in XLogInsert(rmid = '\n', info = '@'), line 453 in
> "xloginsert.c"
> XLogInsert(rmid = '\n', info = '@'), line 453 in "xloginsert.c"
> log_heap_update(reln = 0x0000000110273540, oldbuf = 40544, newbuf =
> 40544, oldtup = 0x0fffffffffffa2a0, newtup = 0x00000001102bb958,
> old_key_tuple = (nil), all_visible_cleared = '\0',
> new_all_visible_cleared = '\0'), line 7708 in "heapam.c"
> unnamed block in heap_update(relation = 0x0000000110273540, otid =
> 0x0fffffffffffa6f8, newtup = 0x00000001102bb958, cid = 1, crosscheck =
> (nil), wait = '^A', hufd = 0x0fffffffffffa5b0, lockmode =
> 0x0fffffffffffa5c8), line 4212 in "heapam.c"
> heap_update(relation = 0x0000000110273540, otid = 0x0fffffffffffa6f8,
> newtup = 0x00000001102bb958, cid = 1, crosscheck = (nil), wait = '^A',
> hufd = 0x0fffffffffffa5b0, lockmode = 0x0fffffffffffa5c8), line 4212
> in "heapam.c"
> unnamed block in ExecUpdate(tupleid = 0x0fffffffffffa6f8, oldtuple =
> (nil), slot = 0x00000001102bb308, planSlot = 0x00000001102b4630,
> epqstate = 0x00000001102b2cd8, estate = 0x00000001102b29e0, canSetTag
> = '^A'), line 937 in "nodeModifyTable.c"
> ExecUpdate(tupleid = 0x0fffffffffffa6f8, oldtuple = (nil), slot =
> 0x00000001102bb308, planSlot = 0x00000001102b4630, epqstate =
> 0x00000001102b2cd8, estate = 0x00000001102b29e0, canSetTag = '^A'),
> line 937 in "nodeModifyTable.c"
> ExecModifyTable(node = 0x00000001102b2c30), line 1516 in
> "nodeModifyTable.c"
> ExecProcNode(node = 0x00000001102b2c30), line 396 in "execProcnode.c"
> ExecutePlan(estate = 0x00000001102b29e0, planstate =
> 0x00000001102b2c30, use_parallel_mode = '\0', operation = CMD_UPDATE,
> sendTuples = '\0', numberTuples = 0, direction = ForwardScanDirection,
> dest = 0x00000001102b7520), line 1569 in "execMain.c"
> standard_ExecutorRun(queryDesc = 0x00000001102b25c0, direction =
> ForwardScanDirection, count = 0), line 338 in "execMain.c"
> ExecutorRun(queryDesc = 0x00000001102b25c0, direction =
> ForwardScanDirection, count = 0), line 286 in "execMain.c"
> ProcessQuery(plan = 0x00000001102b1510, sourceText = "UPDATE
> pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;",
> params = (nil), dest = 0x00000001102b7520, completionTag = ""), line
> 187 in "pquery.c"
> unnamed block in PortalRunMulti(portal = 0x0000000110115e20,
> isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x00000001102b7520,
> altdest = 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c"
> unnamed block in PortalRunMulti(portal = 0x0000000110115e20,
> isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x00000001102b7520,
> altdest = 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c"
> PortalRunMulti(portal = 0x0000000110115e20, isTopLevel = '^A',
> setHoldSnapshot = '\0', dest = 0x00000001102b7520, altdest =
> 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c"
> unnamed block in PortalRun(portal = 0x0000000110115e20, count =
> 9223372036854775807, isTopLevel = '^A', dest = 0x00000001102b7520,
> altdest = 0x00000001102b7520, completionTag = ""), line 815 in "pquery.c"
> PortalRun(portal = 0x0000000110115e20, count = 9223372036854775807,
> isTopLevel = '^A', dest = 0x00000001102b7520, altdest =
> 0x00000001102b7520, completionTag = ""), line 815 in "pquery.c"
> unnamed block in exec_simple_query(query_string = "UPDATE
> pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;"),
> line 1094 in "postgres.c"
> exec_simple_query(query_string = "UPDATE pgbench_tellers SET tbalance
> = tbalance + 4019 WHERE tid = 6409;"), line 1094 in "postgres.c"
> unnamed block in PostgresMain(argc = 1, argv = 0x0000000110119f68,
> dbname = "postgres", username = "postgres"), line 4076 in "postgres.c"
> PostgresMain(argc = 1, argv = 0x0000000110119f68, dbname = "postgres",
> username = "postgres"), line 4076 in "postgres.c"
> BackendRun(port = 0x0000000110114290), line 4279 in "postmaster.c"
> BackendStartup(port = 0x0000000110114290), line 3953 in "postmaster.c"
> unnamed block in ServerLoop(), line 1701 in "postmaster.c"
> unnamed block in ServerLoop(), line 1701 in "postmaster.c"
> unnamed block in ServerLoop(), line 1701 in "postmaster.c"
> ServerLoop(), line 1701 in "postmaster.c"
> PostmasterMain(argc = 3, argv = 0x00000001100c6190), line 1309 in
> "postmaster.c"
> main(argc = 3, argv = 0x00000001100c6190), line 228 in "main.c"
>
>
> As I already mentioned, we built Postgres with LOCK_DEBUG , so we can
> inspect lock owner. Backend is waiting for itself!
> Now please look at two frames in this stack trace marked with red.
> XLogInsertRecord is setting WALInsert locks at the beginning of the
> function:
>
> if (isLogSwitch)
> WALInsertLockAcquireExclusive();
> else
> WALInsertLockAcquire();
>
> WALInsertLockAcquire just selects random item from WALInsertLocks
> array and exclusively locks:
>
> if (lockToTry == -1)
> lockToTry = MyProc->pgprocno % NUM_XLOGINSERT_LOCKS;
> MyLockNo = lockToTry;
> immed = LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);
>
> Then, following the stack trace, AdvanceXLInsertBuffer calls
> WaitXLogInsertionsToFinish:
>
> /*
> * Now that we have an up-to-date LogwrtResult value, see
> if we
> * still need to write it or if someone else already did.
> */
> if (LogwrtResult.Write < OldPageRqstPtr)
> {
> /*
> * Must acquire write lock. Release WALBufMappingLock
> first,
> * to make sure that all insertions that we need to
> wait for
> * can finish (up to this same position). Otherwise we
> risk
> * deadlock.
> */
> LWLockRelease(WALBufMappingLock);
>
> WaitXLogInsertionsToFinish(OldPageRqstPtr);
>
> LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
>
>
> It releases WALBufMappingLock but not WAL insert locks!
> Finally in WaitXLogInsertionsToFinish tries to wait for all locks:
>
> for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
> {
> XLogRecPtr insertingat = InvalidXLogRecPtr;
>
> do
> {
> /*
> * See if this insertion is in progress. LWLockWait will
> wait for
> * the lock to be released, or for the 'value' to be set by a
> * LWLockUpdateVar call. When a lock is initially
> acquired, its
> * value is 0 (InvalidXLogRecPtr), which means that we
> don't know
> * where it's inserting yet. We will have to wait for it. If
> * it's a small insertion, the record will most likely fit
> on the
> * same page and the inserter will release the lock
> without ever
> * calling LWLockUpdateVar. But if it has to sleep, it will
> * advertise the insertion point with LWLockUpdateVar before
> * sleeping.
> */
> if (LWLockWaitForVar(&WALInsertLocks[i].l.lock,
> &WALInsertLocks[i].l.insertingAt,
> insertingat, &insertingat))
>
> And here we stuck!
> The comment to WaitXLogInsertionsToFinish says:
>
> * Note: When you are about to write out WAL, you must call this function
> * *before* acquiring WALWriteLock, to avoid deadlocks. This function
> might
> * need to wait for an insertion to finish (or at least advance to next
> * uninitialized page), and the inserter might need to evict an old
> WAL buffer
> * to make room for a new one, which in turn requires WALWriteLock.
>
> Which contradicts to the observed stack trace.
>
> I wonder if it is really synchronization bug in xlog.c or there is
> something wrong in this stack trace and it can not happen in case of
> normal work?
>
> Thanks in advance,
> --
> Konstantin Knizhnik
> Postgres Professional:http://www.postgrespro.com
> The Russian Postgres Company

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2017-01-24 16:29:09 Re: logical-replication.sgml improvements
Previous Message Robert Haas 2017-01-24 16:25:36 Re: [ patch ] pg_dump: new --custom-fetch-table and --custom-fetch-value parameters