some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)

From: john gale <john(at)smadness(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)
Date: 2014-02-14 00:48:43
Message-ID: C8F8C964-F3C7-491A-9D88-5EF55BC00243@smadness.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


We are running a Postgres 9.3.2 server on a Mac Pro running OS X 10.9.0. As an automation results server it's heavily inserted into (30-80 queries per sec), periodically selected from (once every few minutes).

We created this instance with Postgres 9.3.0 and after a time ran into an issue where some SELECTs that usually completed in 2s were deadlocking, still around hours or days later. I read something that seemed related to this issue under the release notes for 9.3.2 although can't find it again now. After upgrading to 9.3.2, the issue went away for a month.

Now we are seeing this again, and I am wondering if this is a Postgres 9.3.* issue or possibly a Mac OS X issue.

The symptoms:

Running 'ps waux' shows a slowly increasing number of postgres processes in SELECT. Retrieving the actual SELECT from pg_stat_activity and running it manually returns with correct results within a few seconds. The query_start time for these stuck queries is hours back. As an example:

db=# select pid, backend_start, query_start, state_change, waiting, state, query from pg_stat_activity where query like '%SELECT%' order by query_start;
pid | backend_start | query_start | state_change | waiting | state | query
22208 | 2014-02-13 13:47:11.846371+00 | 2014-02-13 13:47:11.852527+00 | 2014-02-13 13:47:11.852542+00 | f | active | SELECT *, testruns.id FROM testruns JOIN test_types ON testruns.test_type_id=test_types.id WHERE testruns.custom_data->'SpawnID' = 'SpawnID-413873596.233222' AND testruns.started_at > '2014-02-11 04:53:16.233222'

You can see the query_start time of 5:47am localtime, and it's now 4:28pm, so the query has been in this state for almost 11 hours.

A sample (or spindump rather, which is more detailed) shows the postmaster process responding to this SELECT as waiting in __sendto in libsystem_kernal.dylib:

Process: postmaster [22208]
Path: /Library/PostgreSQL/9.3/bin/postmaster
Architecture: x86_64
Parent: postmaster [1051]
Responsible: postmaster [1051]
UID: 502
Task size: 7930 pages
Importance: Donating
IO Policy: Standard

Thread 0x7b3205 DispatchQueue 1 priority 31
1002 start + 52 (postmaster) [0x1000028c4]
1002 main + 667 (postmaster) [0x100285acb]
1002 PostmasterMain + 6970 (postmaster) [0x10033ca1a]
1002 ServerLoop + 707 (postmaster) [0x10033d763]
1002 BackendStartup + 461 (postmaster) [0x100341fcd]
1002 BackendRun + 680 (postmaster) [0x100342b98]
1002 PostgresMain + 2396 (postmaster) [0x1003bb41c]
1002 exec_simple_query + 1382 (postmaster) [0x1003b57e6]
1002 PortalRun + 702 (postmaster) [0x1003bd95e]
1002 PortalRunSelect + 235 (postmaster) [0x1003bdd8b]
1002 ExecutorRun + 88 (postmaster) [0x100225a58]
1002 standard_ExecutorRun + 314 (postmaster) [0x100225b9a]
1002 ExecutePlan + 214 (postmaster) [0x1002281e6]
1002 printtup + 554 (postmaster) [0x1000090da]
1002 pq_endmessage + 58 (postmaster) [0x10028493a]
1002 pq_putmessage + 187 (postmaster) [0x100283d5b]
1002 internal_putbytes + 54 (postmaster) [0x100283896]
1002 internal_flush + 98 (postmaster) [0x100283a92]
1002 __sendto + 10 (libsystem_kernel.dylib) [0x7fff8ac79a9a]
*1002 hndl_unix_scall64 + 22 (mach_kernel) [0xffffff80002f3e06]
*1002 unix_syscall64 + 499 (mach_kernel) [0xffffff800063de23]
*1002 sendto_nocancel + 177 (mach_kernel) [0xffffff800061f531]
*1002 ??? (mach_kernel + 4323232) [0xffffff800061f7a0]
*1002 sosend + 634 (mach_kernel) [0xffffff80006139aa]
*1002 sosendcheck + 583 (mach_kernel) [0xffffff8000613647]
*1002 sbwait + 297 (mach_kernel) [0xffffff800061a939]
*1002 msleep + 114 (mach_kernel) [0xffffff80005e1302]
*1002 ??? (mach_kernel + 4067541) [0xffffff80005e10d5]
*1002 lck_mtx_sleep + 78 (mach_kernel) [0xffffff800022d17e]
*1002 thread_block_reason + 204 (mach_kernel) [0xffffff8000235d4c]
*1002 ??? (mach_kernel + 223563) [0xffffff800023694b]
*1002 machine_switch_context + 354 (mach_kernel) [0xffffff80002d81d2]

The client that called this SELECT is waiting around for network traffic:

Process: Python [20954]
Path: /System/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
Architecture: x86_64
Parent: Python [20951]
UID: 501
Task size: 12802 pages

Thread 0x6ab05f DispatchQueue 1 priority 31
1001 start + 1 (libdyld.dylib) [0x7fff884287e1]
1001 Py_Main + 2715 (Python) [0x10ace0e27]
1001 PyRun_SimpleFileExFlags + 410 (Python) [0x10acbc726]
1001 PyRun_FileExFlags + 165 (Python) [0x10acbcd3c]
1001 ??? (Python + 339056) [0x10acbcc70]
1001 PyEval_EvalCode + 54 (Python) [0x10ac809b3]
1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
1001 ??? (Python + 121055) [0x10ac878df]
1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
1001 ??? (Python + 121055) [0x10ac878df]
1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
1001 ??? (Python + 121055) [0x10ac878df]
1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
1001 PyEval_EvalFrameEx + 3573 (Python) [0x10ac81f82]
1001 PyObject_Call + 97 (Python) [0x10ac796c6]
1001 ??? (Python + 331130) [0x10acbad7a]
1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
1001 ??? (Python + 121055) [0x10ac878df]
1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
1001 ??? (Python + 121055) [0x10ac878df]
1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
1001 PyEval_EvalFrameEx + 9389 (Python) [0x10ac8363a]
1001 ??? (Python + 121055) [0x10ac878df]
1001 PyEval_EvalCodeEx + 1934 (Python) [0x10ac81147]
1001 PyEval_EvalFrameEx + 9244 (Python) [0x10ac835a9]
1001 pg_query + 142 (_pg.so) [0x10b1e2efe]
1001 PQexecFinish + 41 (libpq.5.dylib) [0x10b1f7b02]
1001 PQgetResult + 136 (libpq.5.dylib) [0x10b1f783d]
1001 pqWaitTimed + 25 (libpq.5.dylib) [0x10b1f9e85]
1001 poll + 10 (libsystem_kernel.dylib) [0x7fff8abe1f96]
*1001 hndl_unix_scall64 + 19 (mach_kernel) [0xffffff80002cdd23]
*1001 unix_syscall64 + 522 (mach_kernel) [0xffffff80005e063a]
*1001 poll_nocancel + 798 (mach_kernel) [0xffffff80005778de]
*1001 kqueue_scan + 187 (mach_kernel) [0xffffff800054e1fb]
*1001 thread_block_reason + 300 (mach_kernel) [0xffffff800022da0c]
*1001 ??? (mach_kernel + 190273) [0xffffff800022e741]
*1001 machine_switch_context + 366 (mach_kernel) [0xffffff80002b403e]

The python pglib clients issuing these calls have been used for years, and we did not see this issue with Postgres 9.0.4, so I don't believe the clients are at fault.

We did not see this issue for years while running Postgres 9.0.4, however we were also on an earlier Mac OS version as well, so there's no clear upgrade path that I can blame.

I see no system.log messages at the time of this issue. We have bumped the shared memory and open file limits on this machine and seem to be well within them at the moment. Other postgres traffic is churning along fine from the other postmasters; it is only these growing few that seem stuck.

Does this sound related to anything that might have been attempted to be fixed in 9.3.2 ? Does the sample backtrace look like it could be a Postgres issue, or could it be a Mac OS X / unix network issue ?

thanks for any help you can give.

~ john

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2014-02-14 01:04:54 Re: some postgres 2s SELECT queries are deadlocking forever in __sendto (libsystem_kernel.dylib)
Previous Message Tom Lane 2014-02-13 20:49:08 Re: BUG #9210: PostgreSQL string store bug? not enforce check with correct characterSET/encoding