From owner-freebsd-stable@FreeBSD.ORG Fri Jan 9 14:49:49 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 17C1C106564A for ; Fri, 9 Jan 2009 14:49:49 +0000 (UTC) (envelope-from ghelmer@palisadesys.com) Received: from cetus.palisadesys.com (cetus.palisadesys.com [205.237.115.21]) by mx1.freebsd.org (Postfix) with ESMTP id B6A808FC14 for ; Fri, 9 Jan 2009 14:49:48 +0000 (UTC) (envelope-from ghelmer@palisadesys.com) Received: from cancer.palisadesys.com (serverwatch [172.16.1.98]) by cetus.palisadesys.com (8.14.3/8.14.3) with ESMTP id n09EnjgH051803; Fri, 9 Jan 2009 08:49:45 -0600 (CST) (envelope-from ghelmer@palisadesys.com) Received: from [172.16.2.242] (cetus.palisadesys.com [205.237.115.21]) (authenticated bits=0) by cancer.palisadesys.com (8.14.2/8.14.2) with ESMTP id n09Eng05050734; Fri, 9 Jan 2009 08:49:43 -0600 (CST) (envelope-from ghelmer@palisadesys.com) Message-ID: <49676406.9050902@palisadesys.com> Date: Fri, 09 Jan 2009 08:49:42 -0600 From: Guy Helmer User-Agent: Thunderbird 2.0.0.19 (Windows/20081209) MIME-Version: 1.0 To: Pete French , freebsd-stable@freebsd.org References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-3.0 (cancer.palisadesys.com [205.237.115.20]); Fri, 09 Jan 2009 08:49:43 -0600 (CST) X-Palisade-MailScanner-Information: Please contact the ISP for more information X-Palisade-MailScanner: Found to be clean X-Palisade-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (not cached, score=-4.399, required 6, autolearn=not spam, ALL_TRUSTED -1.80, BAYES_00 -2.60) X-Palisade-MailScanner-From: ghelmer@palisadesys.com Cc: Subject: Re: Big problems with 7.1 locking up :-( X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Jan 2009 14:49:49 -0000 Pete French wrote: > I have a number of HP 1U servers, all of which were running 7.0 > perfectly happily. I have been testing 7.1 in it's various incarnations > for the last couple of months on our test server and it has performed > perfectly. > > So the last two days I have been round upgrading all our servers, knowing > that I had run the system stably on identical hardware for some time. > > Since then I have starte seeing machines lock up. This always happens under > heavy disc load. When I bring the machine back up then sometimes it fails > to fsck due to a partialy truncated inode. The locksup appear to > be disc related - on my mysql msater machine it will come back up with > files somewhat shorted than those which ahve aready been transmitted to > the slave (i.e. some data was in memory, and claimed to have been written > to the drive, but never made it onto the disc). > > The only time I have seen anything useful on the screen was during one lockup > where I got a message about a spin lock being held too long and some > comment in parentheses about it being a turnstile lock. > > Help! :-( > > I am now downgrading all the machine to 7.0 as fast as I can - though the > machine I am trying to compile it on has locked up once during the compile > so I havent got anywhere so far. > > The machines are HP Proliant DL360 G5s - they have an embedded P400i > RAID controller with a pair of mirrored drives connected. Each one has > both ethernets connected, bundled using lagg and LACP. > > I can't tell whether my situation is related, but I am seeing lockups on SMP Supermicro servers with both older (NetBurst-ish) and current Xeon CPUs. I have been dropping into the kernel debugger and getting lock information and process backtraces, but so far nothing has been conclusively identified. I think the issue I'm seeing was introduced sometime between October 2 and November 24 in the RELENG_7 branch, and I suppose the next step is to do a binary search for the offending change. Guy -- Guy Helmer, Ph.D. Chief System Architect Palisade Systems, Inc.