From owner-freebsd-stable@FreeBSD.ORG Tue Jun 16 13:40:42 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB553106564A for ; Tue, 16 Jun 2009 13:40:42 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id A922F8FC1F for ; Tue, 16 Jun 2009 13:40:42 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 5C11546B94; Tue, 16 Jun 2009 09:40:42 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 6A4508A072; Tue, 16 Jun 2009 09:40:41 -0400 (EDT) From: John Baldwin To: freebsd-stable@freebsd.org Date: Tue, 16 Jun 2009 08:30:29 -0400 User-Agent: KMail/1.9.7 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200906160830.29721.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Tue, 16 Jun 2009 09:40:41 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: pluknet Subject: Re: 6.2 sporadically locks up X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jun 2009 13:40:43 -0000 On Tuesday 16 June 2009 6:23:47 am pluknet wrote: > Hi all. > > This is one of livelocks we have on a weekly basis. > Yes, we do still use ULE scheduler on 6.2 and not moved to 7 yet. > Any thought? > > db> ps > pid ppid pgrp uid state wmesg wchan cmd > 70304 69700 69670 0 R sh > 70303 70292 93818 3572 RL CPU 2 chrsh > 70302 70294 93818 3572 R crond > 70299 93818 93818 0 R CPU 1 crond > 70298 93818 93818 0 R crond > 70294 93818 93818 3572 S piperd 0xd1d8d330 crond > 70292 93818 93818 3572 R crond > 70284 70279 70040 10229 S biord 0xdbe2e4e8 perl5.8.8 > 70283 70278 93818 10229 SL biord 0xdbd70710 exim-4.63-0 > 70279 70040 70040 10229 S wait 0xc9005860 sh > 70278 69996 93818 10229 S wait 0xcaf4ac90 sh > 70191 4680 4680 9738 S select 0xc0a12944 httpd > 70190 4796 4796 10008 R httpd > 70188 5043 5043 30532 RL httpd > 70043 69999 70043 3572 Ss select 0xc0a12944 wget > 70042 70000 70042 3572 Ss select 0xc0a12944 wget > 70041 70001 70041 3572 Ss select 0xc0a12944 wget > 70040 69996 70040 10229 Ss piperd 0xca35e990 perl5.8.8 > 70039 70002 70039 3572 Ss select 0xc0a12944 wget This is not a full listing so one cannot assume it is a deadlock. > db> show lockchain Giant > thread -3420549 (pid 434, ) ??? (0xc099cb0c) You would use 'show lock' or perhaps 'show turnstile' with specific lock variables. 'show lockchain' needs a TID or PID. > db> show allpcpu > cpuid = 0 > curthread = 0xc7cfec80: pid 18 "swi4: clock sio" > > cpuid = 1 > curthread = 0xc99f9960: pid 70299 "crond" > > cpuid = 2 > curthread = 0xc99f9af0: pid 70303 "chrsh" > > cpuid = 3 > curthread = 0xd087d320: pid 69700 "sh" > > cpuid = 4 > curthread = 0xc98f84b0: pid 69604 "httpd" > > cpuid = 5 > curthread = 0xcaebe190: pid 69598 "httpd" > > cpuid = 6 > curthread = 0xc7cfe960: pid 27 "irq17: bce1 aacu0" > > cpuid = 7 > curthread = 0xc837fe10: pid 69711 "arcconf" This is far more useful output than the truncated 'ps'. From this, all of the CPUs are busy (in at least some deadlocks, all the CPUs would be idle instead). There are several deadlocks fixed since 6.2 that I am aware of, but this doesn't look like any of those. I'm not sure why you aren't getting useful stack traces of running threads. Perhaps DDB in 6.2 doesn't know to look in stoppcbs[]. Hmm, looks like 6.2 only does that if you are using KDB_STOP_NMI. Are you using that kernel option? If not, you probably want to. -- John Baldwin