From owner-freebsd-stable@FreeBSD.ORG Thu Feb 22 22:21:21 2007 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 79C7816A405; Thu, 22 Feb 2007 22:21:21 +0000 (UTC) (envelope-from ghelmer@palisadesys.com) Received: from cetus.palisadesys.com (cetus.palisadesys.com [192.188.162.7]) by mx1.freebsd.org (Postfix) with ESMTP id 1033213C48E; Thu, 22 Feb 2007 22:21:20 +0000 (UTC) (envelope-from ghelmer@palisadesys.com) Received: from magellan.palisadesys.com (serverwatch [172.16.1.98]) by cetus.palisadesys.com (8.13.8/8.13.8) with ESMTP id l1MLsn6X060313; Thu, 22 Feb 2007 15:54:50 -0600 (CST) (envelope-from ghelmer@palisadesys.com) Received: from [172.16.2.242] (cetus.palisadesys.com [192.188.162.7]) (authenticated bits=0) by magellan.palisadesys.com (8.13.8/8.13.8) with ESMTP id l1MLsKa8058156 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 22 Feb 2007 15:54:20 -0600 (CST) (envelope-from ghelmer@palisadesys.com) Message-ID: <45DE1110.8050208@palisadesys.com> Date: Thu, 22 Feb 2007 15:54:24 -0600 From: Guy Helmer User-Agent: Thunderbird 1.5.0.9 (Windows/20061207) MIME-Version: 1.0 To: Rink Springer References: <20070220105902.GC39393@rink.nu> In-Reply-To: <20070220105902.GC39393@rink.nu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-3.0 (magellan.palisadesys.com [192.188.162.211]); Thu, 22 Feb 2007 15:54:21 -0600 (CST) X-Palisade-MailScanner-Information: Please contact the ISP for more information X-Palisade-MailScanner: Found to be clean X-Palisade-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (not cached, score=-3.199, required 6, autolearn=not spam, ALL_TRUSTED -1.80, BAYES_00 -2.60, J_CHICKENPOX_36 0.60, J_CHICKENPOX_38 0.60) X-Palisade-MailScanner-From: ghelmer@palisadesys.com Cc: stable@freebsd.org, roel@qsp.nl Subject: Re: Deadlock in state 'sysctl lock' X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 Feb 2007 22:21:21 -0000 Rink Springer wrote: > Hi people, > > At work, one of our SpamAssassin/ClamAV filtering machines just entered > a deadlock state: > > FreeBSD/i386 (xxx.qsp.nl) (cuad0) > > login: root > load: 0.00 cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k > load: 0.00 cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k > load: 0.00 cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k > load: 0.00 cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k > load: 0.00 cmd: login 683 [sysctl lock] 0.00u 0.00s 0% 148k > > After inspection, I believe the following code in > kern/kern_sysctl.c:userland_sysctl() is the culprit: > > SYSCTL_LOCK(); > > do { > req.oldidx = 0; > req.newidx = 0; > error = sysctl_root(0, name, namelen, &req); > } while (error == EAGAIN); > > if (req.lock == REQ_WIRED && req.validlen > 0) > vsunlock(req.oldptr, req.validlen); > > SYSCTL_UNLOCK(); > > Clearly, should sysctl_root() always return EAGAIN, this will cause a > serious deadlock condition. It appears this is possible. > > The only plausible reference to sysctl's returning EGAIN seems to be in > kern/kern_proc.c:sysctl_out_proc(). However, this code returns ESRCH > if the process couldn't have been found in the fast place, and since the > complete handler function will be called by sysctl_root() every > iteration, and thus will do a pfind() and return ESRCH if it failed and > not EAGAIN as it will later on in the code path. > > The machine is a 6.0-STABLE SMP machine of 30-Mar-2006. No debugging > options are in the kernel as the machine has quite some load. The only > console messages were a lot of 'calcru' messages. > > Any help is very much appreciated. For now, I'd like to propose a change > to kern/kern_sysctl.c:userland_sysctl(), to ensure this will never keep > looping on EAGAIN states (preferably, it should trigger a panic or at > least a KASSERT should such a condition occour). I know this is a > bandaid for a problem we don't really quite understand yet, but this may > ease debugging later on (especially as it will help us understand where > exactly it is going bad) > > Any comments? It looks to me this deadlock is quite rare (in fact, I've > never seen it before), but I believe it is serious enough to be > addressed, even with such a bandaid until the real solution is presented > by someone who knows the sysctl internals better than I do. > > Interesting. Twice I have had a 6.2 system stuck where sendmail was holding the sysctl lock while another process was holding the proctree and/or allproc lock, if I remember correctly. Guy