Re: Possible fix for occasional failures on castoroides etc

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Dave Page <dpage(at)pgadmin(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Possible fix for occasional failures on castoroides etc
Date: 2014-05-03 18:59:28
Message-ID: 31133.1399143568@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> Unfortunately, it seems the Solaris implementors didn't read Stevens,
> because it looks to me like they *do* return ECONNREFUSED on accept queue
> overflow. Still, it's hard to see how that would be the issue if we're
> still seeing this failure with only five clients.

Also, after further inspection of the source code, it appears to me that
the kernel's limit on accept queue length is hard-wired at 4096 in
Solaris. So there's basically no way that we're hitting that limit in the
regression tests, and the MAX_CONNECTIONS configuration is irrelevant.

We seem to be left with the race condition theory. In that connection,
this comment in /usr/src/uts/common/io/tl.c is interesting:

* The T_CONN_CON is generated when processing the T_CONN_REQ i.e. before
* a T_CONN_RES is received from the acceptor. This means that a socket
* connect will complete before the peer has called accept.

I'm not sure that explains anything of value, but it's probably unlike any
other implementation, which makes it perhaps relevant. It implies that
this is totally unrelated to any server-side behavior; so if it's possible
for us to work around it at all, we'd have to do so client-side.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2014-05-03 19:29:18 Re: Possible fix for occasional failures on castoroides etc
Previous Message Bruce Momjian 2014-05-03 18:20:27 pgindent run