Re: [INTERFACES] timeouts in libpq- can libpq requests block forever/a very long time?

From: Dirk Niggemann <dirk-n(at)dircon(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-interfaces(at)postgresql(dot)org
Subject: Re: [INTERFACES] timeouts in libpq- can libpq requests block forever/a very long time?
Date: 1999-10-03 01:51:07
Message-ID: 37F6B68B.5ABC566F@dircon.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-interfaces

Tom Lane wrote:
>
> Dirk Niggemann <dirk-n(at)dircon(dot)co(dot)uk> writes:
> > Would there be any problems with changing pqWait so it uses a timeout?
Actually, I've just gone away and implemented it- silly, silly me.
The timeout can be completely disabled and is settable from PQconnect
and from some new environment variables. (PGTIMEOUT and PGCONNTIMEOUT).
It remains off by default.
>
> I think this would be rather risky: if I issue a query that the backend
> takes a long time to compute, I don't want libpq timing out on me,
> *especially* not if its response is to close the connection rather than
> some fail-soft recovery technique.
What kind of fail-soft could be implemented here?
I suppose I could be nice & stick a PQrequestCancel in there, but I'll
have to make sure it doesn't frazzle any other bit of the system. Hold
that thought: PQrequestCancel is _designed_ to be called asynchronously
to all te other requests, so it's unlikely to cause any major probs..
(have I missed something?).
That doesn't however, deal with the main concern about the timeout-
there
is no 100% reliable way to tell the difference between "this qeuery is
running
a long time without returning results" and "the network is down" or
"the backend is hung"
The elegant solution would be to implement some sort of
backend-ping-protocol. The
problem with this is that the backend needs to be prepared to service
it- even if it's
doing other things. Not the sort of ting the backend was designed to do.
BTW- as libpq doesn't set SO_KEEPALIVE, queries that run for > 15minutes
w/o any data returned will disconnect anyway (provided things are't too
confused).

>From my POV, the purpose behind this is not to deal with
long-computation queries.
It's to deal with unattended automatic applications that run on remote
sites over flaky
networks that splat relatively small sets of inserts (<150 rows,
consisting of 30-40
integers max)into a table in a central DB.
Now if I really wanted to do this right, I would build some sort of
reliable queuing middleware...
Except all that costs, well, real money. and this is only an
experimental project.
The applications do these uploads at regular intervals. The applications
have to recover and perform the inserts at a later stage if the network
is not working properly.
I expect the network not to be working properly.
The applications run 'unattended' and require no user interaction.
> If the timeout is
> application-settable and defaults to infinity, it might be OK.
Yes, after having a closer look at PQrequestCancel I sort of see where
the problem lies.
libpq makes some effort at not closing the backend socket.
That's why I left the original behaviour in place.
Can you describe to me what the backend does on socket close?
I should really grovel the code myself, but my main concern here is:
a) I guess the backend simply aborts the current query/transaction.
Nasty if you were running an
enormous query that will take a long time to run. Otherwise not too
catastrophic.
b) I'm concerned the backend might do some blocking somewhere w/o a
timeout-
that is, can closing a socket at a bad time cause the backend to get
stuck?
c) can a failed network connection cause the backend to get stuck in a
wait/blocking I/O?
Having hung backends lying around could be a serious problem, as that
would reduce the
total number of backends available (max backends does not change
dynamically).

d) Am I misunderstanding here or is PostgresSQL 'single threaded' with
respect to requests
over a single client connection? libpq certainly gives that
impression. Shoving an empty query at the backend while it is returning
rows to you would confuse it very badly?
e) just like the PQrequestCancel, could you (me?) build a sort of
PQrequestPing to
confirm that nothing is reachable or functioning at the remote end
before libpq closes the socket?

>
> In my own applications using the asynchronous query features, I have not
> seen any actual hangups that could be improved by a timeout ... but I've
> mostly been working across LAN rather than WAN connections...
What hangs did you see?

the kind of problem I'm talking about is the sort where routers go and
disappear on you...
after you've opened the connection...
or someone chops a cable on a remote site...
all fun stuff...
try this sometime:
a) open connection to a remote lan over a single router (with no
backup/standby)...
b) start shoving data over it.
c) shoot router by preferred method (ax, fire, power plug, admin typo,
dripping aircon (my favourite)).
d) see how long your socket connection takes to work out the network's
gone away...

>
> > I would also like to add a timeout for the connect operation on the
> > socket-
> > on a sufficiently frazzled WAN this could block for a long time-
>
> This might be useful, but again I wonder whether libpq ought to be
> second-guessing the transport system about how long is reasonable to
> wait for a response.
done as well-
second-guessing the transport system is a good idea sometimes- the
default TCP/ip timeouts are all rather long.
Adjusting them generally means fiddling with kernel parameters (might
break kernel. might break some other apps.)

I certainly have seen problems on Solaris connection to DEC (oracle) RDB
over a flaky WAN.
the application would occasionally 'stick' for at least 8-10 minutes
when we simulated the
problem by pulling some cables. In reality it would regularly stick for
1/2 hour.
(routers went up&down and let packets drip through on occasions. made
both ends think the
network was just extremly slow rather than broken.)
Oracle's connection timeout solution is particularly insidious(I have
this by word of mouth only)-
it seems to fork children to asynchronously handle the requests that
time out.
crashed a machine when 30 concurrent client apps all suddenly lost
connection to the same database.

apologies for seeming a bit argumentative on this point...
BTW- just realized another problem-
we can't know for certain a if a commit happened or not if we get a
timeout...
eek...
example
normal flow:
COMMIT->
->COMMIT
<-RESPONSE
<-RESPONSE

error flow:
COOMMIT->
->COMMIT
<-ERROR
<-ERROR

timeout flow (ok)
COMMIT->
timeout
close
->close
ABORT
(transaction assumed aborted).

timeout flow (bad)
COMMIT->
->COMMIT
<-RESPONSE
timeout
close
(transaction assumed aborted but has actually committed)
->close

I cant' seem to come up with the worst-case scenario
(e.g. false positive on commit)
maybe it's not possible...

or do we have a three-way handshake for commits?

>
> regards, tom lane

Thanks,

Dirk

In response to

Browse pgsql-interfaces by date

  From Date Subject
Next Message Douglas Thomson 1999-10-03 04:14:33 Re: [INTERFACES] problems with large-objects
Previous Message John Cusick 1999-10-03 00:49:57 Re: [INTERFACES] PL_na undeclared