Date: Fri, 09 Jan 2009 08:49:42 -0600 From: Guy Helmer <ghelmer@palisadesys.com> To: Pete French <petefrench@ticketswitch.com>, freebsd-stable@freebsd.org Subject: Re: Big problems with 7.1 locking up :-( Message-ID: <49676406.9050902@palisadesys.com> In-Reply-To: <E1LL6dg-0007CN-DI@dilbert.ticketswitch.com> References: <E1LL6dg-0007CN-DI@dilbert.ticketswitch.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Pete French wrote: > I have a number of HP 1U servers, all of which were running 7.0 > perfectly happily. I have been testing 7.1 in it's various incarnations > for the last couple of months on our test server and it has performed > perfectly. > > So the last two days I have been round upgrading all our servers, knowing > that I had run the system stably on identical hardware for some time. > > Since then I have starte seeing machines lock up. This always happens under > heavy disc load. When I bring the machine back up then sometimes it fails > to fsck due to a partialy truncated inode. The locksup appear to > be disc related - on my mysql msater machine it will come back up with > files somewhat shorted than those which ahve aready been transmitted to > the slave (i.e. some data was in memory, and claimed to have been written > to the drive, but never made it onto the disc). > > The only time I have seen anything useful on the screen was during one lockup > where I got a message about a spin lock being held too long and some > comment in parentheses about it being a turnstile lock. > > Help! :-( > > I am now downgrading all the machine to 7.0 as fast as I can - though the > machine I am trying to compile it on has locked up once during the compile > so I havent got anywhere so far. > > The machines are HP Proliant DL360 G5s - they have an embedded P400i > RAID controller with a pair of mirrored drives connected. Each one has > both ethernets connected, bundled using lagg and LACP. > > I can't tell whether my situation is related, but I am seeing lockups on SMP Supermicro servers with both older (NetBurst-ish) and current Xeon CPUs. I have been dropping into the kernel debugger and getting lock information and process backtraces, but so far nothing has been conclusively identified. I think the issue I'm seeing was introduced sometime between October 2 and November 24 in the RELENG_7 branch, and I suppose the next step is to do a binary search for the offending change. Guy -- Guy Helmer, Ph.D. Chief System Architect Palisade Systems, Inc.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?49676406.9050902>