From owner-freebsd-stable@FreeBSD.ORG  Fri Mar 13 09:37:22 2009
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DF2FE1065673
	for <freebsd-stable@freebsd.org>; Fri, 13 Mar 2009 09:37:21 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 9E2B98FC08
	for <freebsd-stable@freebsd.org>; Fri, 13 Mar 2009 09:37:21 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [65.122.17.41])
	by cyrus.watson.org (Postfix) with ESMTPS id 3D66A46B46;
	Fri, 13 Mar 2009 05:37:21 -0400 (EDT)
Date: Fri, 13 Mar 2009 09:37:21 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Nick Withers <nick@nickwithers.com>
In-Reply-To: <1236920519.1490.30.camel@localhost>
Message-ID: <alpine.BSF.2.00.0903130935290.61873@fledge.watson.org>
References: <1236920519.1490.30.camel@localhost>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII
Cc: freebsd-stable@freebsd.org
Subject: Re: NICs locking up, "*tcp_sc_h"
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Mar 2009 09:37:22 -0000


On Fri, 13 Mar 2009, Nick Withers wrote:

> I recently installed my first amd64 system (currently running RELENG_7 from 
> 2009-03-11) to replace an aged ppc box and have been having dramas with the 
> network locking up.
>
> Breaking into the debugger manually and ps-ing shows the network card (e.g., 
> "[irq20:  fxp0+]") in state "LL" in "*tcp_sc_h". It seems the process(es) 
> trying to access the card at the time is / are in state "L" in "*tcp".
>
> I thought this may have been something-or-other in the fxp driver, so 
> installed an rl card and sadly ran into the issue again.
>
> The console appears unresponsive, but I can get into the debugger (and as 
> soon as I have, input I'd sent seems to "go through", e.g., if I hit "Enter" 
> a couple o' times, nothing happens; when I <Ctrl>+<Alt>+<Esc> into the 
> debugger a few login prompts pop up before the debugger output).
>
> A "where" on the fxp / rl process (thread?) gives (transcribed from the 
> console): ____

Sounds like a lock leak -- if you're running INVARIANTS, then "show allocks" 
and "show allchains" would be useful.  I've had a report of a TCP lock leak 
possibly in tcp_input(), but haven't managed to track it down yet -- this 
could well be it as well.

Robert N M Watson
Computer Laboratory
University of Cambridge


>
> Tracing PID 31 tid 100030 td 0xffffff00012016e0
> sched_switch() at sched_switch+0xf1
> mi_switch() at mi_switch+0x18f
> turnstile_wait() at turnstile_wait+0x1cf
> _mtx_lock_sleep() at _mtx_lock_sleep+0x76
> syncache_lookup() at syncache_lookup+0x176
> syncache_expand() at syncache_expand+0x38
> tcp_input() at tcp_input+0xa7d
> ip_input() at ip_input+0xa8
> ether_demux() at ether_demux+0x1b9
> ether_input() at ether_input+0x1bb
> fxp_intr() at fxp_intr+0x233
> ithread_loop() at ithread_loop+0x17f
> fork_exit() at fork_exit+0x11f
> fork_trampoline() at fork_trampoline+0xe
> ____
>
> A "where" on a process stuck in "*tcp", in this case "[swi4: clock]",
> gave the somewhat similar:
> ____
>
> sched_switch() at sched_switch+0xf1
> mi_switch() at mi_switch+0x18f
> turnstile_wait() at turnstile_wait+0x1cf
> _rw_rlock() at _rw_rlock+0x8c
> ipfw_chk() at ipfw_chk+0x3ab2
> ipfw_check_out() at ipfw_check_out+0xb1
> pfil_run_hooks() at pfil_run_hooks+0x9c
> ip_output() at ip_output+0x367
> syncache_respond() at syncache_respond+0x2fd
> syncache_timer() at syncache_timer+0x15a
> (...)
> ____
>
> In this particular case, the fxp0 card is in a lagg with rl0, but this
> problem can be triggered with either card on their own...
>
> The scheduler is SCHED_ULE.
>
> I'm not too sure how to give more useful information that this, I'm
> afraid. It's a custom kernel, too... Do I need to supply information on
> what code actually exists at the relevant addresses (I'm not at all
> clued in on how to do this... Sorry!)? Should I chuck WITNESS,
> INVARIANTS et al. in?
>
> I *think* every time this has been triggered there's been a "python2.5"
> process in the "*tcp" state. This machine runs net-p2p/deluge and
> generally has at least 100 TCP connections on the go at any given time.
>
> Can anyone give me a clue as to what I might do to track this down?
> Appreciate any pointers.
> -- 
> Nick Withers
> email: nick@nickwithers.com
> Web: http://www.nickwithers.com
> Mobile: +61 414 397 446
>