From owner-freebsd-net@FreeBSD.ORG  Mon Feb  2 18:44:20 2009
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C6A381065672
	for <freebsd-net@freebsd.org>; Mon,  2 Feb 2009 18:44:20 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id A22198FC14
	for <freebsd-net@freebsd.org>; Mon,  2 Feb 2009 18:44:20 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [65.122.17.41])
	by cyrus.watson.org (Postfix) with ESMTPS id 5778146B06;
	Mon,  2 Feb 2009 13:44:20 -0500 (EST)
Date: Mon, 2 Feb 2009 18:44:20 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Mitar <mmitar@gmail.com>
In-Reply-To: <f63c4b2d0901311625n16cd1103ve9f765673d7047cf@mail.gmail.com>
Message-ID: <alpine.BSF.2.00.0902021832350.77103@fledge.watson.org>
References: <f63c4b2d0901311625n16cd1103ve9f765673d7047cf@mail.gmail.com>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-net@freebsd.org
Subject: Re: read() returns ETIMEDOUT
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Feb 2009 18:44:21 -0000


On Sun, 1 Feb 2009, Mitar wrote:

> Is there any progress on this error reported:
>
> http://freebsd.monkey.org/freebsd-net/200805/msg00026.html
>
> I have the same or very similar issue. On my server large HTTP uploads are 
> failing because there are only one direction data transmissions (when 
> reading/receiving a request) and kernel drops connections after some time 
> with ETIMEDOUT returning from read() even if transmissions are doing just 
> fine with steady speed, tested at different speeds.
>
> Is there any workaround currently known?

Given that some time has passed since the previous reports, it's probably best 
to do a diagnosis from scratch rather than assume it's necessarily the same. 
Could you send us the output of the following commands:

sysctl net.inet.tcp | grep keep

There are a number of situations in which ETIMEDOUT may be set when a 
connection is dropped, so we should figure out which one(s) it may be:

(1) TCP keepalive timer fires and finds one of the following cases: the
     connection isn't yet established or the keepalive timer has expired.
     (tcp_timer_keep)

(2) TCP persist timer fires because the window is closed and the full
     exponential backoff has occurred.  (tcp_timer_persist)

(3) TCP retransmit timer reaches its full exponntial backoff without being
     ACK'd.  (tcp_timer_rexmt)

There are a few ways to go about this -- probably the easiest is to drop a 
kernel printf just before each call to tcp_drop(tp, ETIMEDOUT) in tcp_timer.c.

It would also be useful, if possible, to look at the tcpdump for the last 
portion of the connection, perhaps ideally from the second-to-last ACK from 
the remote host to the connection reset from the local end.  It might be worth 
running tcpdump on both sides to see if they see the same thing -- for 
example, does one side think it's sending ACKs and the other not receive it?

In the previous thread, it looked a bit like the outcome was that there was a 
memory exhaustion issue under load, and that bumping nmbclusters helped at 
least defer that problem.  So it would be useful to see the output of netstat 
-m before and after (for as small an epsilon as you can make it) the 
connection is timed out.  I realize capturing the above sorts of data can be 
an issue on high-load boxes but if we can, it would be quite helpful. 
Regardless of that, knowing if you're seeing allocation errors in the netstat 
-m output would be helpful.

Robert N M Watson
Computer Laboratory
University of Cambridge