From: | Malcolm MacLeod <malcolm(dot)macleod(at)tshwanedje(dot)com> |
---|---|
To: | "pgsql-odbc(at)postgresql(dot)org" <pgsql-odbc(at)postgresql(dot)org> |
Subject: | Re: Issue with retry_count in socket.c and fix for the same |
Date: | 2014-05-26 16:42:26 |
Message-ID: | 1401122546.14277.8.camel@watchmen.homenetwork |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-odbc |
On Thu, 2014-05-15 at 14:25 +0300, Heikki Linnakangas wrote:
> On 05/15/2014 11:10 AM, Itnal, Prakash (NSN - IN/Bangalore) wrote:
> > Recently we encountered a case where clients connected to postgres server through psqlODBC driver got strucked (hanged) when server unexpectedly shutdown (hard power off or LAN cable removed). The clients were hanged forever. I analyzed the code and found a potential issue with retry count logic in function SOCK_wait_for_ready() (in socket.c).
> >
> > The no_timeout variable is initialized to TRUE and is not modified if retry_count is positive number. That means if retry_count is passed then it will treat it as "no timeout". If I understood it correctly it should work other way. If retry_count is positive number then socket should wait with timeout and not indefinitely.
> >
> > I have changed the default initialization of no_timeout to FALSE. Attached is the patch. Kindly review the same and let me know comments.
> >
> > Copied the function below for quick reference: socket.c: SOCK_wait_for_ready()
> >
> > 528 static int SOCK_wait_for_ready(SocketClass *sock, BOOL output, int retry_count)
> > 529 {
> > 530 int ret, gerrno;
> > 531 #ifdef HAVE_POLL
> > 532 struct pollfd fds;
> > 533 #else
> > 534 fd_set fds, except_fds;
> > 535 struct timeval tm;
> > 536 #endif /* HAVE_POLL */
> > 537 BOOL no_timeout = TRUE;
> > 538
> > 539 if (0 == retry_count)
> > 540 no_timeout = FALSE;
> > 541 else if (0 > retry_count)
> > 542 no_timeout = TRUE;
> > 543 #ifdef USE_SSL
> > 544 else if (sock->ssl == NULL)
> > 545 no_timeout = TRUE;
> > 546 #endif /* USE_SSL */
> > 547 do {
> > 548 #ifdef HAVE_POLL
> > 549 fds.fd = sock->socket;
> > 550 fds.events = output ? POLLOUT : POLLIN;
> > 551 fds.revents = 0;
> > 552 ret = poll(&fds, 1, no_timeout ? -1 : retry_count * 1000);
> > 553 mylog("!!! poll ret=%d revents=%x\n", ret, fds.revents);
> > 554 #else
> > 555 FD_ZERO(&fds);
> > 556 FD_ZERO(&except_fds);
> > 557 FD_SET(sock->socket, &fds);
> > 558 FD_SET(sock->socket, &except_fds);
> > 559 if (!no_timeout)
> > 560 {
> > 561 tm.tv_sec = retry_count;
> > 562 tm.tv_usec = 0;
> > 563 }
> > 564 ret = select((int) sock->socket + 1, output ? NULL : &fds, output ? &fds : NULL, &except_fds, no_timeout ? NULL : &tm);
> > 565 #endif /* HAVE_POLL */
> > 566 gerrno = SOCK_ERRNO;
> > 567 } while (ret < 0 && EINTR == gerrno);
> > 568 if (retry_count < 0)
> > 569 retry_count *= -1;
> > 570 if (0 == ret && retry_count > MAX_RETRY_COUNT)
> > 571 {
> > 572 ret = -1;
> > 573 SOCK_set_error(sock, output ? SOCKET_WRITE_TIMEOUT : SOCKET_READ_TIMEOUT, "SOCK_wait_for_ready timeout");
> > 574 }
> > 575 return ret;
> > 576 }
>
> Ugh. Yes, it's clearly bogus as it is. no_timeout is only ever set to
> FALSE, when retry_count is 0. And when retry_count is 0, the timeout is
> also set to 0. So the whole timeout and retry_count thing is a very
> complicated way of saying that when retry_count==0, just check if there
> is any outstanding data to read / room for writing without blocking,
> otherwise wait forever.
>
> I'm not so sure it's supposed to behave like you described either,
> though. For example, if we normally a timeout by default, why would we
> not want a timeout when SSL is enabled? And the "0 > retry_count" check
> is never going to be true (unless retry_count wraps around)
>
> I wonder why we need any timeout or retry counters, ever. Frankly, I'm
> inclined to just rip all that out. A timeout would be a useful feature
> for many applications, to detect a broken connection more quickly, but
> the way it's currently implemented is pretty much useless for that. For
> starters, it would have to be configurable, because many applications
> would not want to have a timeout at all. A built-in timeout of 60
> seconds or so, which I think you'd get with your patch, would be too
> short for OLAP kind of queries that can run for hours before returning
> any results.
Just wanted to confirm that I have tracked this down as the cause for a
freeze/crash in our software as well.
I can get it to block forever on the select call just by
disconnecting/reconnecting network on a client a few times while it is
querying lots of data from the server.
I've tested the original patch and it does fix the issue, would be great
to see a final fix for this.
Thanks,
Malcolm MacLeod
From | Date | Subject | |
---|---|---|---|
Next Message | Rodrigo Fabiam | 2014-05-26 18:44:22 | Doubt about ODBC Driver |
Previous Message | Michael Paquier | 2014-05-26 03:35:00 | Re: Typo fixes in Solution.pm, part of MSVC scripts |