From owner-freebsd-stable@FreeBSD.ORG Fri Mar 13 09:37:22 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DF2FE1065673 for ; Fri, 13 Mar 2009 09:37:21 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 9E2B98FC08 for ; Fri, 13 Mar 2009 09:37:21 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [65.122.17.41]) by cyrus.watson.org (Postfix) with ESMTPS id 3D66A46B46; Fri, 13 Mar 2009 05:37:21 -0400 (EDT) Date: Fri, 13 Mar 2009 09:37:21 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Nick Withers In-Reply-To: <1236920519.1490.30.camel@localhost> Message-ID: References: <1236920519.1490.30.camel@localhost> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII Cc: freebsd-stable@freebsd.org Subject: Re: NICs locking up, "*tcp_sc_h" X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Mar 2009 09:37:22 -0000 On Fri, 13 Mar 2009, Nick Withers wrote: > I recently installed my first amd64 system (currently running RELENG_7 from > 2009-03-11) to replace an aged ppc box and have been having dramas with the > network locking up. > > Breaking into the debugger manually and ps-ing shows the network card (e.g., > "[irq20: fxp0+]") in state "LL" in "*tcp_sc_h". It seems the process(es) > trying to access the card at the time is / are in state "L" in "*tcp". > > I thought this may have been something-or-other in the fxp driver, so > installed an rl card and sadly ran into the issue again. > > The console appears unresponsive, but I can get into the debugger (and as > soon as I have, input I'd sent seems to "go through", e.g., if I hit "Enter" > a couple o' times, nothing happens; when I ++ into the > debugger a few login prompts pop up before the debugger output). > > A "where" on the fxp / rl process (thread?) gives (transcribed from the > console): ____ Sounds like a lock leak -- if you're running INVARIANTS, then "show allocks" and "show allchains" would be useful. I've had a report of a TCP lock leak possibly in tcp_input(), but haven't managed to track it down yet -- this could well be it as well. Robert N M Watson Computer Laboratory University of Cambridge > > Tracing PID 31 tid 100030 td 0xffffff00012016e0 > sched_switch() at sched_switch+0xf1 > mi_switch() at mi_switch+0x18f > turnstile_wait() at turnstile_wait+0x1cf > _mtx_lock_sleep() at _mtx_lock_sleep+0x76 > syncache_lookup() at syncache_lookup+0x176 > syncache_expand() at syncache_expand+0x38 > tcp_input() at tcp_input+0xa7d > ip_input() at ip_input+0xa8 > ether_demux() at ether_demux+0x1b9 > ether_input() at ether_input+0x1bb > fxp_intr() at fxp_intr+0x233 > ithread_loop() at ithread_loop+0x17f > fork_exit() at fork_exit+0x11f > fork_trampoline() at fork_trampoline+0xe > ____ > > A "where" on a process stuck in "*tcp", in this case "[swi4: clock]", > gave the somewhat similar: > ____ > > sched_switch() at sched_switch+0xf1 > mi_switch() at mi_switch+0x18f > turnstile_wait() at turnstile_wait+0x1cf > _rw_rlock() at _rw_rlock+0x8c > ipfw_chk() at ipfw_chk+0x3ab2 > ipfw_check_out() at ipfw_check_out+0xb1 > pfil_run_hooks() at pfil_run_hooks+0x9c > ip_output() at ip_output+0x367 > syncache_respond() at syncache_respond+0x2fd > syncache_timer() at syncache_timer+0x15a > (...) > ____ > > In this particular case, the fxp0 card is in a lagg with rl0, but this > problem can be triggered with either card on their own... > > The scheduler is SCHED_ULE. > > I'm not too sure how to give more useful information that this, I'm > afraid. It's a custom kernel, too... Do I need to supply information on > what code actually exists at the relevant addresses (I'm not at all > clued in on how to do this... Sorry!)? Should I chuck WITNESS, > INVARIANTS et al. in? > > I *think* every time this has been triggered there's been a "python2.5" > process in the "*tcp" state. This machine runs net-p2p/deluge and > generally has at least 100 TCP connections on the go at any given time. > > Can anyone give me a clue as to what I might do to track this down? > Appreciate any pointers. > -- > Nick Withers > email: nick@nickwithers.com > Web: http://www.nickwithers.com > Mobile: +61 414 397 446 >