From owner-freebsd-current@FreeBSD.ORG Sun Dec 10 03:11:42 2006 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D016716A40F; Sun, 10 Dec 2006 03:11:42 +0000 (UTC) (envelope-from avatar@mmlab.cse.yzu.edu.tw) Received: from www.mmlab.cse.yzu.edu.tw (www.mmlab.cse.yzu.edu.tw [140.138.150.166]) by mx1.FreeBSD.org (Postfix) with ESMTP id A4C5543C9F; Sun, 10 Dec 2006 03:10:34 +0000 (GMT) (envelope-from avatar@mmlab.cse.yzu.edu.tw) Received: by www.mmlab.cse.yzu.edu.tw (qmail, from userid 1000) id 97AE68C9A18; Sun, 10 Dec 2006 11:11:40 +0800 (CST) Received: from localhost (localhost [127.0.0.1]) by www.mmlab.cse.yzu.edu.tw (qmail) with ESMTP id 72FE98C984F; Sun, 10 Dec 2006 11:11:40 +0800 (CST) Date: Sun, 10 Dec 2006 11:11:40 +0800 (CST) From: Tai-hwa Liang To: Robert Watson In-Reply-To: <20061209214233.L2273@fledge.watson.org> Message-ID: <0612101036232.41529@www.mmlab.cse.yzu.edu.tw> References: <52944.192.168.1.110.1165679313.squirrel@yal.hopto.org> <20061209195519.B60055@mp2.macomnet.net> <20061209204924.N9926@fledge.watson.org> <20061209214233.L2273@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Andrew Pantyukhin , freebsd-current@freebsd.org, yal Subject: Re: CURRENT freezes on Laitude D520 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 10 Dec 2006 03:11:42 -0000 On Sat, 9 Dec 2006, Robert Watson wrote: [...] > Right now, setting debug.mpsafenet=1 has three effects: > > (1) Place Giant over the network stack, creating a single lock that spans the > entire stack, preventing parallelism, as well as acting as a "master" > lock > which implicitly prevents lock order-related deadlocks in the stack. > > (2) Effectively disabling preemption in the network stack, as ithreads and > the > netisr will be unable to start running until user threads exit the stack, > regardless of priority. > > (3) Effectively disable direct dispatch, as non-MPSAFE netisr handlers are > always deferred rather than executing in the ithread context. > > I suspect that many of the people setting debug.mpsafenet=1 and declaring the > problem fixed are seeing the change due to (2) and (3), indirect rather than > direct effects of (1). I would much rather people experimented with: > > - Disabling direct dispatch (net.isr.direct=0) > > - Disabling preemption (compiling out options PREEMPTION) > > - Running with WITNESS, which reports lock order reversals. > > which get a bit more to the heart of most problems. debug.mpsafenet=1 really > exists for the purposes of supporting components which are not sufficiently > locked to allow the stack to run MPSAFE, rather than as a means of disabling > direct dispatch and preemption, which speak to different types of problems. > The main reason that I haven't removed the administrator tunable to date is > that I suspect it will be quite helpful when KAME IPSEC locking happens, but > since that appears not to have happened yet, debug.mpsafenet as an option is > likely causing more harm than good by being available as a stand-in sysctl > masking other problems, causing people to not get to the point of properly > identifying the actual cause (device driver bugs, etc). Can the aforementioned tricks(1/2/3) being applied to RELENG_6 as well? We are using RELENG_6 as our production server(postfix, squid, pf firewall/NAT, FAST_IPSEC VPN, ...), which is a dual Athlon MP board with three NICs(two fxp cards and one onboard xl, connected to three different networks). I haven't try WITNESS, yet; however, I'm very sure that net.isr.direct=0 plus that there is no PREEMPTION in current kernel. The problem is that, with debug.mpsafenet=1, we'll always run into hard freeze w/o having any kdb> prompt on console. Whilst turning debug.mpsafenet off only masks the real problem, I'm still wondering about if there is any less damaging way to track such problem down in a _production_ environment. -- Thanks, Tai-hwa Liang