Re: Issue with retry_count in socket.c and fix for the same

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: "Itnal, Prakash (NSN - IN/Bangalore)" <prakash(dot)itnal(at)nsn(dot)com>, "pgsql-odbc(at)postgresql(dot)org" <pgsql-odbc(at)postgresql(dot)org>
Subject: Re: Issue with retry_count in socket.c and fix for the same
Date: 2014-05-15 11:25:05
Message-ID: 5374A411.4090907@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-odbc

On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote:
> Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked (hanged) when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I analyzed the code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c).
>
> The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if retry_count is passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If retry_count is positive number then socket should wait with timeout and not indefinitely.
>
> I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and let me know comments.
>
> Copied the function below for quick reference: socket.c: SOCK_wait_for_ready()
>
> 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
> 529 {
> 530 int ret, gerrno;
> 531 #ifdef HAVE_POLL
> 532 struct pollfd fds;
> 533 #else
> 534 fd_set fds, except_fds;
> 535 struct timeval tm;
> 536 #endif /* HAVE_POLL */
> 537 BOOL no_timeout = TRUE;
> 538
> 539 if (0 == retry_count)
> 540 no_timeout = FALSE;
> 541 else if (0 > retry_count)
> 542 no_timeout = TRUE;
> 543 #ifdef USE_SSL
> 544 else if (sock->ssl == NULL)
> 545 no_timeout = TRUE;
> 546 #endif /* USE_SSL */
> 547 do {
> 548 #ifdef HAVE_POLL
> 549 fds.fd = sock->socket;
> 550 fds.events = output ? POLLOUT : POLLIN;
> 551 fds.revents = 0;
> 552 ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);
> 553 mylog("!!! poll ret=%d revents=%x\n", ret, fds.revents);
> 554 #else
> 555 FD_ZERO(&fds);
> 556 FD_ZERO(&except_fds);
> 557 FD_SET(sock->socket, &fds);
> 558 FD_SET(sock->socket, &except_fds);
> 559 if (!no_timeout)
> 560 {
> 561 tm.tv_sec = retry_count;
> 562 tm.tv_usec = 0;
> 563 }
> 564 ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ? NULL : &tm);
> 565 #endif /* HAVE_POLL */
> 566 gerrno = SOCK_ERRNO;
> 567 } while (ret < 0 && EINTR == gerrno);
> 568 if (retry_count < 0)
> 569 retry_count *= -1;
> 570 if (0 == ret && retry_count > MAX_RETRY_COUNT)
> 571 {
> 572 ret = -1;
> 573 SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout");
> 574 }
> 575 return ret;
> 576 }

Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to
FALSE, when retry_count is 0. And when retry_count is 0, the timeout is
also set to 0. So the whole timeout and retry_count thing is a very
complicated way of saying that when retry_count==0, just check if there
is any outstanding data to read / room for writing without blocking,
otherwise wait forever.

I'm not so sure it's supposed to behave like you described either,
though. For example, if we normally a timeout by default, why would we
not want a timeout when SSL is enabled? And the "0 > retry_count" check
is never going to be true (unless retry_count wraps around)

I wonder why we need any timeout or retry counters, ever. Frankly, I'm
inclined to just rip all that out. A timeout would be a useful feature
for many applications, to detect a broken connection more quickly, but
the way it's currently implemented is pretty much useless for that. For
starters, it would have to be configurable, because many applications
would not want to have a timeout at all. A built-in timeout of 60
seconds or so, which I think you'd get with your patch, would be too
short for OLAP kind of queries that can run for hours before returning
any results.

- Heikki

In response to

Responses

Browse pgsql-odbc by date

  From Date Subject
Next Message Itnal, Prakash (NSN - IN/Bangalore) 2014-05-15 12:13:44 Re: Issue with retry_count in socket.c and fix for the same
Previous Message Dev Kumkar 2014-05-15 09:46:25 Re: Fetching Date and Timestamp values