Re: Escaping a blocked sendto() syscall without causing a restart

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jerry Sievers <gsievers19(at)comcast(dot)net>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Escaping a blocked sendto() syscall without causing a restart
Date: 2013-01-17 21:38:40
Message-ID: 17914.1358458720@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Jerry Sievers <gsievers19(at)comcast(dot)net> writes:
> Does anyone know if one of the signals below can be sent to break out
> ,of this state *without* the postmaster sensing a crashed backend?

> I've seen several times in the past at other companies, backends that
> will not respond to cancel nor SIGTERM due to syscall that's blocked
> on IO.

> Quite often though apparently the backend would notice the broken
> socket eventually and receive the signals and exit cleanly.

> I've got one that's been wedged like that for a couple days now.

> I recall trying several in a similar situation a while ago and of
> course one of them interrupted the syscall all right but it was an
> abort and we got the customary spontaneous postmaster restart.

Offhand it looks to me like most signals would kick the backend off the
send() call ... but it would loop right back and try again. See
internal_flush() in pqcomm.c. (If you're using SSL, this diagnosis
may or may not apply.)

We can't do anything except repeat the send attempt if the client
connection is to be kept in a sane state. It's possible that if the
interrupt was a SIGTERM (forced exit) we could mark the connection dead
and return early, but it would probably take some thought and
experimentation to get useful behavior that way. And I'm not at all
sure if we could get it to work in SSL mode ...

So the short answer is no, you probably can't kill the session without
causing a restart. Possibly we should add a TODO to make this better.

What you might consider instead, if this is a recurring problem, is
adjusting the postmaster-side TCP keepalive parameters so that dead
connections are noticed more quickly. The default connection timeout
according to the TCP standards is on the order of hours, but you can
reduce that quite a lot if your network environment is at all reliable.

(But it's not clear to me why your stuck-for-a-couple-days case wouldn't
have timed out long since. Are you sure this isn't a client-side
problem, ie client is wedged? If so, why not kill the client instead?)

regards, tom lane

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Kevin Grittner 2013-01-17 23:32:11 Re: Question concerning replicated server using streaming replication used as a read-only reporting server
Previous Message Jerry Sievers 2013-01-17 20:36:24 Escaping a blocked sendto() syscall without causing a restart