Re: [Pgbuildfarm-members] Submission failures: 500 read timeout

From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Marti Raudsepp <marti(at)juffo(dot)org>,Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PGBuildFarm <pgbuildfarm-members(at)pgfoundry(dot)org>
Subject: Re: [Pgbuildfarm-members] Submission failures: 500 read timeout
Date: 2014-09-22 09:21:41
Message-ID: 541FEA25.7080605@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: buildfarm-members

On 09/22/2014 11:15 AM, Marti Raudsepp wrote:
> On Mon, Sep 15, 2014 at 7:15 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>> I have turned on request timing in the web logs. It looks like these status
>> uploads are typically taking 1 to 2 seconds to process. So I suspect it's
>> client-related.
>
> Well I managed to capture only 1 packet dump of this happening, on
> 2014-09-18 11:51:36 EEST. The problem seems to have disappeared, did
> configuration change on the server side? Or maybe it's just that fewer
> commits have been pushed recently. If anyone is interested, I can send
> the dump privately.
>
> I'm no expert on TCP, but it's conceivably a bug in the TCP stack. I'd
> like to collect a few more samples before bothering any networking
> people with it. Here's my understanding of what happened:
>
> 11:51:36.433 First packet of HTTP POST request is sent
> (data being sent)
> 11:51:39.254 Last packet of POST body
> 11:51:39.494 buildfarm responds with a SACK which, I believe,
> indicates a dropped packet
> (3 minutes pass silently)
> 11:54:38.010 My end sends a FIN, probably a timeout on client side,
> closing the socket
> 11:54:38.212 buildfarm responds with another SACK, repeating the missing packet
> (3 seconds, some retransmits occur for the missing data)
> 11:54:41.215 My end sends a RST (probably timeout because remote
> didn't have time to acknowledge the FIN yet)
> 11:54:41.236 Remote responds with "HTTP 200 OK", before it could have
> received my RST, but my local end no longer sees it because the
> connection is already reset.
>
> If my reading of RFC 2018 (SACK) is right, the sender must retransmit
> data after receiving a SACK packet if the missing data isn't
> acknowledged during the retransmit timeout. But this did not happen
> for 3 minutes. I don't know whether the receiver (buildfarm) should
> retransmit its SACK or not, but that only happened after it had
> received the FIN packet.

hard to say - but that description feels like a common problem going 10
years backwards when stateful firewalls started doing sequence
inspection and randomisation but were not yet SACK aware.

It might be a long stretch but maybe the path between your box and the
buildfarm box is a bit lossy (as in small but regular) packetloss _AND_
there is a device on either side that has a slightly broken stateful
inspection firewall (old cisco PIX/ASA, some sonicwals, cisco FWSM, very
very old linux kernel ipchain/iptables issues) or very aggressive
timings on TCP sessions (ie misguided DoS prevention)

Stefan

In response to

Responses

Browse buildfarm-members by date

  From Date Subject
Next Message Marti Raudsepp 2014-09-22 09:59:25 Re: [Pgbuildfarm-members] Submission failures: 500 read timeout
Previous Message Marti Raudsepp 2014-09-22 09:15:16 Re: [Pgbuildfarm-members] Submission failures: 500 read timeout