FreeBSD Mail Archives

Date:      Fri, 13 Mar 2009 20:56:24 +1100
From:      Nick Withers <nick@nickwithers.com>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: NICs locking up, "*tcp_sc_h"
Message-ID:  <1236938184.1490.40.camel@localhost>
In-Reply-To: <alpine.BSF.2.00.0903130935290.61873@fledge.watson.org>
References:  <1236920519.1490.30.camel@localhost> <alpine.BSF.2.00.0903130935290.61873@fledge.watson.org>


--=-yN0PxImbmCfVRg/hrCoV
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On Fri, 2009-03-13 at 09:37 +0000, Robert Watson wrote:
> On Fri, 13 Mar 2009, Nick Withers wrote:
>=20
> > I recently installed my first amd64 system (currently running RELENG_7 =
from=20
> > 2009-03-11) to replace an aged ppc box and have been having dramas with=
 the=20
> > network locking up.
> >
> > Breaking into the debugger manually and ps-ing shows the network card (=
e.g.,=20
> > "[irq20:  fxp0+]") in state "LL" in "*tcp_sc_h". It seems the process(e=
s)=20
> > trying to access the card at the time is / are in state "L" in "*tcp".
> >
> > I thought this may have been something-or-other in the fxp driver, so=20
> > installed an rl card and sadly ran into the issue again.
> >
> > The console appears unresponsive, but I can get into the debugger (and =
as=20
> > soon as I have, input I'd sent seems to "go through", e.g., if I hit "E=
nter"=20
> > a couple o' times, nothing happens; when I <Ctrl>+<Alt>+<Esc> into the=20
> > debugger a few login prompts pop up before the debugger output).
> >
> > A "where" on the fxp / rl process (thread?) gives (transcribed from the=
=20
> > console): ____
>=20
> Sounds like a lock leak -- if you're running INVARIANTS, then "show alloc=
ks"=20
> and "show allchains" would be useful.  I've had a report of a TCP lock le=
ak=20
> possibly in tcp_input(), but haven't managed to track it down yet -- this=
=20
> could well be it as well.

Righto, I'll recompile the kernel with INVARIANTS (hell, I'll go bananas
and include everything listed in
http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-deadloc=
ks.html - anything else I might include?).

Sorry for the original double-post, by the way, not quite sure how that
happened...

I can reproduce this problem relatively easily, by the way (every 3
days, on average). I meant to say this before, too, but it seems to
happen a lot more often on the fxp than on rl.

I'm sorry to ask what is probably a very simple question, but is there
somewhere I should look to get clues on debugging from a manually
generated dump? I tried "panic" after manually envoking the kernel
debugger but proved highly inept at getting from the dump the same
information "ps" / "where" gave me within the debugger live.

Ta for your help!

> Robert N M Watson
> Computer Laboratory
> University of Cambridge
>=20
>=20
> >
> > Tracing PID 31 tid 100030 td 0xffffff00012016e0
> > sched_switch() at sched_switch+0xf1
> > mi_switch() at mi_switch+0x18f
> > turnstile_wait() at turnstile_wait+0x1cf
> > _mtx_lock_sleep() at _mtx_lock_sleep+0x76
> > syncache_lookup() at syncache_lookup+0x176
> > syncache_expand() at syncache_expand+0x38
> > tcp_input() at tcp_input+0xa7d
> > ip_input() at ip_input+0xa8
> > ether_demux() at ether_demux+0x1b9
> > ether_input() at ether_input+0x1bb
> > fxp_intr() at fxp_intr+0x233
> > ithread_loop() at ithread_loop+0x17f
> > fork_exit() at fork_exit+0x11f
> > fork_trampoline() at fork_trampoline+0xe
> > ____
> >
> > A "where" on a process stuck in "*tcp", in this case "[swi4: clock]",
> > gave the somewhat similar:
> > ____
> >
> > sched_switch() at sched_switch+0xf1
> > mi_switch() at mi_switch+0x18f
> > turnstile_wait() at turnstile_wait+0x1cf
> > _rw_rlock() at _rw_rlock+0x8c
> > ipfw_chk() at ipfw_chk+0x3ab2
> > ipfw_check_out() at ipfw_check_out+0xb1
> > pfil_run_hooks() at pfil_run_hooks+0x9c
> > ip_output() at ip_output+0x367
> > syncache_respond() at syncache_respond+0x2fd
> > syncache_timer() at syncache_timer+0x15a
> > (...)
> > ____
> >
> > In this particular case, the fxp0 card is in a lagg with rl0, but this
> > problem can be triggered with either card on their own...
> >
> > The scheduler is SCHED_ULE.
> >
> > I'm not too sure how to give more useful information that this, I'm
> > afraid. It's a custom kernel, too... Do I need to supply information on
> > what code actually exists at the relevant addresses (I'm not at all
> > clued in on how to do this... Sorry!)? Should I chuck WITNESS,
> > INVARIANTS et al. in?
> >
> > I *think* every time this has been triggered there's been a "python2.5"
> > process in the "*tcp" state. This machine runs net-p2p/deluge and
> > generally has at least 100 TCP connections on the go at any given time.
> >
> > Can anyone give me a clue as to what I might do to track this down?
> > Appreciate any pointers.
> > --=20
> > Nick Withers
> > email: nick@nickwithers.com
> > Web: http://www.nickwithers.com
> > Mobile: +61 414 397 446
> >
--=20
Nick Withers
email: nick@nickwithers.com
Web: http://www.nickwithers.com
Mobile: +61 414 397 446

--=-yN0PxImbmCfVRg/hrCoV
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (FreeBSD)

iEYEABECAAYFAkm6LcgACgkQ3wcG/Pf4Wri5oQCgrnr2HM886la1EgfV6l9PEgFo
do4AnAozGLuBxJNvlzAI2pZNkvTkKd0C
=Z8yR
-----END PGP SIGNATURE-----

--=-yN0PxImbmCfVRg/hrCoV--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1236938184.1490.40.camel>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation