From: | Erik Hesselink <hesselink(at)gmail(dot)com> |
---|---|
To: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Deadlock in libpq |
Date: | 2011-03-24 14:07:51 |
Message-ID: | AANLkTi=8rsg6ghaf+kLF1+ed7VqS9njARzJ2W442JOAP@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Thu, Mar 24, 2011 at 14:23, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Mar 24, 2011 at 4:17 AM, Erik Hesselink <hesselink(at)gmail(dot)com> wrote:
>> Hi,
>>
>> We're getting a deadlock in our application (a web application with a
>> PostgreSQL backend) which I've traced to libpq. I've started our
>> application in gdb, and when it hangs, I've inspected the backtraces.
>> I've found a couple of threads I can account for (listening for new
>> connections, background processes) and 77 threads waiting for a mutex
>> lock:
>>
>> #0 0x00007ffff523d464 in __lll_lock_wait () from /lib/libpthread.so.0
>> #1 0x00007ffff52385d9 in _L_lock_953 () from /lib/libpthread.so.0
>> #2 0x00007ffff52383fb in pthread_mutex_lock () from /lib/libpthread.so.0
>> #3 0x00007ffff6160650 in ?? () from /usr/lib/libpq.so.5
>> ==> pg_lockingcallback
>> #4 0x00007ffff440b791 in ?? () from /lib/libcrypto.so.0.9.8
>> #5 0x00007ffff440bcc9 in ?? () from /lib/libcrypto.so.0.9.8
>> #6 0x00007ffff47652fb in SSL_new () from /lib/libssl.so.0.9.8
>> #7 0x00007ffff61604dc in ?? () from /usr/lib/libpq.so.5
>> ==> pqsecure_open_client
>> #8 0x00007ffff61525ce in PQconnectPoll () from /usr/lib/libpq.so.5
>> #9 0x00007ffff6152f5e in ?? () from /usr/lib/libpq.so.5
>> ==> connectDBComplete
>> #10 0x00007ffff6153c5f in PQconnectdb () from /usr/lib/libpq.so.5
>> #11 0x0000000000f9b518 in sccR_info ()
>> #12 0x0000000000000000 in ?? ()
>>
>> So it seems everything is waiting for a lock on a mutex from
>> pq_lockarray (in fe-secure(dot)c(at)846). Does anybody have any idea how this
>> can happen? Is this something we're doing wrong (I hope so) or a bug
>> in libpq?
>>
>> Some background: this happens only after a couple of thousand requests
>> (each doing about 15 database calls), with occasional other requests
>> coming in at the same time. Our server uses a Haskell binding to libpq
>> (HDBC [1] and HDBC-postgresql [2]). Both client and server run on the
>> same machine, running 64bit Ubuntu 10.04. The database version is
>> "PostgreSQL 8.4.7 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real
>> (Ubuntu 4.4.3-4ubuntu5) 4.4.3, 64-bit". I'm not sure how to determine
>> the libpq version, but it is the most recent that comes with this
>> ubuntu. The changelogs for Ubuntu suggest 8.4.7 as well. Connections
>> are via TCP/IP to 127.0.0.1 with SSL turned on. The machine is under
>> some CPU load when this happens. There is plenty of free memory.
>>
>> When I turned off SSL or connect via domain sockets, we got different
>> errors that are possibly related: occasionally, the connection between
>> client (our app) and server (database) is lost. On the client, we get:
>>
>> connectPostgreSQL: server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>>
>> and on the server:
>>
>> could not send data to client: Broken pipe
>>
>> There is no further context around these messages.
>>
>> Any help would be greatly appreciated.
>
> How did you initialize ssl? You are waiting inside a lock that is
> getting set up inside the crypto library. Unless you are having some
> type of library initialization issue, I'm suspicious the problem is
> really inside libpq. Is your application multithreaded, and if so are
> you properly synchronizing access to the connection object, etc?
What do you mean exactly with "How did you initialize ssl"? I found
[1], which I did not know about. This seems to be a very non-local
problem: if one of our dependencies initializes ssl, and I use libpq
as well, this will go wrong. I've done a quick look through all our
dependencies, and none seem to use libcrypto or libssl.
Our application is definitely multithreaded, as it is a web
application. But every database transaction creates a new connection
object. They are never shared between threads.
The problem is very hard to reproduce. I've taken all queries that
were performed when I last reproduced it, and have only those queries
(and inserts/updates) running in two concurrent loops, but so far,
that hasn't reproduced the problem yet. A couple of hours of running
our application with a script performing requests against it can
reproduce it, though.
If this seems to be a problem inside libpq, should I create a bug
report? I'm hesitant, as I don't have any steps to reproduce.
--
Erik Hesselink
http://silkapp.com
[1] http://www.postgresql.org/docs/8.4/static/libpq-ssl.html#LIBPQ-SSL-INITIALIZE
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2011-03-24 14:21:19 | Re: Deadlock in libpq |
Previous Message | Adrian Klaver | 2011-03-24 14:01:49 | Re: [RMX:#] Re: Strange loss of data during INSERT |