From owner-freebsd-current@FreeBSD.ORG Sun Dec 10 13:08:38 2006 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 3BF0516A403; Sun, 10 Dec 2006 13:08:38 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9985B43CA1; Sun, 10 Dec 2006 13:07:27 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 1B34846F56; Sun, 10 Dec 2006 08:08:37 -0500 (EST) Date: Sun, 10 Dec 2006 13:08:36 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Tai-hwa Liang In-Reply-To: <0612101036232.41529@www.mmlab.cse.yzu.edu.tw> Message-ID: <20061210084254.X9926@fledge.watson.org> References: <52944.192.168.1.110.1165679313.squirrel@yal.hopto.org> <20061209195519.B60055@mp2.macomnet.net> <20061209204924.N9926@fledge.watson.org> <20061209214233.L2273@fledge.watson.org> <0612101036232.41529@www.mmlab.cse.yzu.edu.tw> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Andrew Pantyukhin , freebsd-current@freebsd.org, yal Subject: Re: CURRENT freezes on Laitude D520 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Dec 2006 13:08:38 -0000 On Sun, 10 Dec 2006, Tai-hwa Liang wrote: >> which get a bit more to the heart of most problems. debug.mpsafenet=1 >> really exists for the purposes of supporting components which are not >> sufficiently locked to allow the stack to run MPSAFE, rather than as a >> means of disabling direct dispatch and preemption, which speak to different >> types of problems. The main reason that I haven't removed the administrator >> tunable to date is that I suspect it will be quite helpful when KAME IPSEC >> locking happens, but since that appears not to have happened yet, >> debug.mpsafenet as an option is likely causing more harm than good by being >> available as a stand-in sysctl masking other problems, causing people to >> not get to the point of properly identifying the actual cause (device >> driver bugs, etc). > > Can the aforementioned tricks(1/2/3) being applied to RELENG_6 as well? WITNESS is available in RELENG_6, and should be used in combination with INVARIANTS, DDB, KDB, and BREAK_TO_DEBUGGER to debug deadlocks. In RELENG_6, net.isr.direct is not enabled by default, so unless you've enabled it yourself (or are using IP fast forwarding, which is functionally similar), that won't apply. In RELENG_6, PREEMPTION is in GENERIC and hence enabled by default, and it can be disabled by removing it from your kernel configuration. I'd like it if we could add a run-time sysctl to disable preemption even if PREEMPTION is compiled in, as it would make it easier to explore its stability and performance impact. However, this is also just a debugging step to see if that quiesces the problem, and not a fix for the actual bug. Right now, we're discussing removing the manual debug.mpsafenet configuration flag from 7.x, and not 6.x. I fully recognize the importance of having it in place as a workaround for bugs in production, although it concerns me greatly that we're not getting these problems debugged and fixed, and instead masking them. Architectural changes are on the way that will require these bugs to be fixed properly, not just masked. > We are using RELENG_6 as our production server(postfix, squid, pf > firewall/NAT, FAST_IPSEC VPN, ...), which is a dual Athlon MP board with > three NICs(two fxp cards and one onboard xl, connected to three different > networks). > > I haven't try WITNESS, yet; however, I'm very sure that net.isr.direct=0 > plus that there is no PREEMPTION in current kernel. The problem is that, > with debug.mpsafenet=1, we'll always run into hard freeze w/o having any > kdb> prompt on console. > > Whilst turning debug.mpsafenet off only masks the real problem, I'm still > wondering about if there is any less damaging way to track such problem down > in a _production_ environment. It sounds like you need to follow the instructions for kernel debugging. Depending on your tolerance of performance loss, downtime, etc, a good starting point is to configure the kernel with INVARIANTS and WITNESS. WITNESS is particularly important, if you can tolerate the performance hit, as it warns of potential deadlocks, not just actual deadlocks. Also, compile the kernel with KDB, DDB, and BREAK_TO_DEBUGGER, and user a serial or firewire console. If the hang occurs, see if you can get into the debugger, in which case the logged output from DDB for the following commands would be very useful: show pcpu show allpcpu trace alltrace ps show locks show alllocks show lockedvnods show uma show malloc Please open a PR that describes your configuration, includes your kernel config (since it seems quite customized), any loader.conf settings, a detailed description of the problem, and the output. I'd be quite interested in know, once the machine is in a hung state, whether the numlock light goes on and off when you hit the numlock key on the keyboard. Robert N M Watson Computer Laboratory University of Cambridge