Date: Fri, 13 Mar 2009 10:10:22 +0000 (GMT) From: Robert Watson <rwatson@FreeBSD.org> To: Nick Withers <nick@nickwithers.com> Cc: freebsd-stable@freebsd.org Subject: Re: NICs locking up, "*tcp_sc_h" Message-ID: <alpine.BSF.2.00.0903131006080.61873@fledge.watson.org> In-Reply-To: <1236938184.1490.40.camel@localhost> References: <1236920519.1490.30.camel@localhost> <alpine.BSF.2.00.0903130935290.61873@fledge.watson.org> <1236938184.1490.40.camel@localhost>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 13 Mar 2009, Nick Withers wrote: > Sorry for the original double-post, by the way, not quite sure how that > happened... > > I can reproduce this problem relatively easily, by the way (every 3 days, on > average). I meant to say this before, too, but it seems to happen a lot more > often on the fxp than on rl. > > I'm sorry to ask what is probably a very simple question, but is there > somewhere I should look to get clues on debugging from a manually generated > dump? I tried "panic" after manually envoking the kernel debugger but proved > highly inept at getting from the dump the same information "ps" / "where" > gave me within the debugger live. If this is, in fact, a TCP input lock leak of some sort, then most likely some particular property of a host your system talks to, or a network it runs over, triggers this (presumably) unusual edge case -- perhaps a firewall that mucks with TCP in a funny way, etc. Of course, it might be something completely different -- the fact that everything is blocked on *tcp_sc_h and *tcp, simply means that something holding TCP locks hasn't released them, and this could happen for a number of reasons. Once you've acquired a crashdump, you can run crashinfo(8), which will produce a summary of useful debugging information. There are some things that are a bit easier to do in the run-time debugger, such as lock analysis, as the run-time debugger is more up-close and personal with in-kernel data structures; other things are easier in kgdb, which has complete source code and C type access. I find kgdb works pretty well for everything but "show much what locks are held". Many of our system monitoring tools, including ps and portions of netstat, can actually be run on crashdumps to report the state of the system at the time it crashed -- take a look at the -M and -N command line arguments, which respectively allow you to point those tools at the crashdump and at a kernel with debugging symbols (typically kernel.debug or kernel.symbols) matching the kernel that was booted at the time of the crash. Robert N M Watson Computer Laboratory University of Cambridge > > Ta for your help! > >> Robert N M Watson >> Computer Laboratory >> University of Cambridge >> >> >>> >>> Tracing PID 31 tid 100030 td 0xffffff00012016e0 >>> sched_switch() at sched_switch+0xf1 >>> mi_switch() at mi_switch+0x18f >>> turnstile_wait() at turnstile_wait+0x1cf >>> _mtx_lock_sleep() at _mtx_lock_sleep+0x76 >>> syncache_lookup() at syncache_lookup+0x176 >>> syncache_expand() at syncache_expand+0x38 >>> tcp_input() at tcp_input+0xa7d >>> ip_input() at ip_input+0xa8 >>> ether_demux() at ether_demux+0x1b9 >>> ether_input() at ether_input+0x1bb >>> fxp_intr() at fxp_intr+0x233 >>> ithread_loop() at ithread_loop+0x17f >>> fork_exit() at fork_exit+0x11f >>> fork_trampoline() at fork_trampoline+0xe >>> ____ >>> >>> A "where" on a process stuck in "*tcp", in this case "[swi4: clock]", >>> gave the somewhat similar: >>> ____ >>> >>> sched_switch() at sched_switch+0xf1 >>> mi_switch() at mi_switch+0x18f >>> turnstile_wait() at turnstile_wait+0x1cf >>> _rw_rlock() at _rw_rlock+0x8c >>> ipfw_chk() at ipfw_chk+0x3ab2 >>> ipfw_check_out() at ipfw_check_out+0xb1 >>> pfil_run_hooks() at pfil_run_hooks+0x9c >>> ip_output() at ip_output+0x367 >>> syncache_respond() at syncache_respond+0x2fd >>> syncache_timer() at syncache_timer+0x15a >>> (...) >>> ____ >>> >>> In this particular case, the fxp0 card is in a lagg with rl0, but this >>> problem can be triggered with either card on their own... >>> >>> The scheduler is SCHED_ULE. >>> >>> I'm not too sure how to give more useful information that this, I'm >>> afraid. It's a custom kernel, too... Do I need to supply information on >>> what code actually exists at the relevant addresses (I'm not at all >>> clued in on how to do this... Sorry!)? Should I chuck WITNESS, >>> INVARIANTS et al. in? >>> >>> I *think* every time this has been triggered there's been a "python2.5" >>> process in the "*tcp" state. This machine runs net-p2p/deluge and >>> generally has at least 100 TCP connections on the go at any given time. >>> >>> Can anyone give me a clue as to what I might do to track this down? >>> Appreciate any pointers. >>> -- >>> Nick Withers >>> email: nick@nickwithers.com >>> Web: http://www.nickwithers.com >>> Mobile: +61 414 397 446 >>> > -- > Nick Withers > email: nick@nickwithers.com > Web: http://www.nickwithers.com > Mobile: +61 414 397 446 >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.0903131006080.61873>