Date: Thu, 15 Mar 2001 11:32:07 +1000 (EST) From: Tony Griffiths <tonyg@OntheNet.com.au> To: John Baldwin <jhb@FreeBSD.ORG> Cc: Andrew Gallatin <gallatin@cs.duke.edu>, <alpha@FreeBSD.ORG> Subject: Re: Deadlocks, whee! Message-ID: <Pine.BSF.4.30.0103151055520.84004-100000@lancia.onthenet.com.au> In-Reply-To: <XFMail.010314164809.jhb@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 14 Mar 2001, John Baldwin wrote: > > On 15-Mar-01 Andrew Gallatin wrote: > > > > John Baldwin writes: > > > Hi all, > > > > > > I managed to deadlock my alpha yesterday with a -j 4 buildworld. > > Previously it > > > would die when it trapped with a raised IPL as a blockable mtx_lock() of > > lockmgr > > > in trap(). I'm not sure if these two things are related or not. I'll try > > a > > > normal world without -j X today to see if it fairs better. Just FYI for > > those > > > running current that heavy load may deadlock right now. :( > > > > The machine is really deadlocked, or just one process is wedged and > > the buildworld stalled? > > Well, no messages on the console, no ddb (I have vidconsole), no pings, etc. > So interrupts aren't getting through, or if they are their threads aren't > running, and since I use preemption on this alpha, that is very, very unlikely. > I'm assuming it is genuinely deadlocked or possibly spinning somewhere with a > raised IPL. Looks like a "deadlock" to me! Actually, I'm surprised that the 'fine-grained' SMP project in FreeBSD has managed to get as far as it has without implementing some form of "sanity" checking. I worked for DEC (Digital Equipment Corp) in the Networking Group at the time Ultrix (BSD 4.2/4.3) was doing fine-grained SMP and we had the following sanity checks in the locking code as an aide to maintaining our own sanity! ;-) 1) Logging of request/release calls 2) Lock hierarchy (ie. take-out ordering) 3) Spin-lock timeout (ie. panic() after 5000000 failed attempts to gain lock) 4) something else that I can't remember 'cause it was too long ago!!! The lock hierarchy was a BIG WIN in detecting/preventing deadlock conditions since it forced an order in lock acquisition although it didn't stop deadlocks from occurring when the locks were at the same level. The spin count exceeded picked those up. We also found a few problems on tri/quad-cpu systems that didn't occur on dual-cpu systems. Of course the amount of checking was a compile-time setting so that production code didn't suffer too badly. We learnt a lot of hard lessons on Ultrix, the main one being that we were too ambitious in trying for a VERY FINE-GRAINED locking strategy (especially in the networking code) than was warrented by any possible payback. Our OSF/Tru-64 implementation was much cleaner with pretty much a single lock at each layer of the network code (eg. socket, tcp/ip, driver). The locking hierarchy caused a few problems between the socket layer and transport but we could get around by using reference counts on objects that needed to stick around even when there was no 'lock' on them! Hope you have more 'fun' then we did (NOT) ... Tony To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-alpha" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.30.0103151055520.84004-100000>