Date: Thu, 12 Feb 2009 12:16:09 -0600 From: Guy Helmer <ghelmer@palisadesys.com> To: Pete French <petefrench@ticketswitch.com>, freebsd-stable@freebsd.org Subject: Re: Big problems with 7.1 locking up :-( Message-ID: <49946769.1040009@palisadesys.com> In-Reply-To: <49676406.9050902@palisadesys.com> References: <E1LL6dg-0007CN-DI@dilbert.ticketswitch.com> <49676406.9050902@palisadesys.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Guy Helmer wrote: > Pete French wrote: >> I have a number of HP 1U servers, all of which were running 7.0 >> perfectly happily. I have been testing 7.1 in it's various incarnations >> for the last couple of months on our test server and it has performed >> perfectly. >> >> So the last two days I have been round upgrading all our servers, >> knowing >> that I had run the system stably on identical hardware for some time. >> >> Since then I have starte seeing machines lock up. This always happens >> under >> heavy disc load. When I bring the machine back up then sometimes it >> fails >> to fsck due to a partialy truncated inode. The locksup appear to >> be disc related - on my mysql msater machine it will come back up with >> files somewhat shorted than those which ahve aready been transmitted to >> the slave (i.e. some data was in memory, and claimed to have been >> written >> to the drive, but never made it onto the disc). >> >> The only time I have seen anything useful on the screen was during >> one lockup >> where I got a message about a spin lock being held too long and some >> comment in parentheses about it being a turnstile lock. >> >> Help! :-( >> >> I am now downgrading all the machine to 7.0 as fast as I can - though >> the >> machine I am trying to compile it on has locked up once during the >> compile >> so I havent got anywhere so far. >> >> The machines are HP Proliant DL360 G5s - they have an embedded P400i >> RAID controller with a pair of mirrored drives connected. Each one has >> both ethernets connected, bundled using lagg and LACP. >> >> > I can't tell whether my situation is related, but I am seeing lockups > on SMP Supermicro servers with both older (NetBurst-ish) and current > Xeon CPUs. I have been dropping into the kernel debugger and getting > lock information and process backtraces, but so far nothing has been > conclusively identified. I think the issue I'm seeing was introduced > sometime between October 2 and November 24 in the RELENG_7 branch, and > I suppose the next step is to do a binary search for the offending > change. > > Guy > FWIW, I think I have tracked down the changes just prior to 7.1-RELEASE that is causing my Supermicro dual Xeon machines to wedge. I did the binary search between 2008-10-02 and 2008-11-24 without reproducing any lockups, and then I went on to search between 2008-11-24 and 2009-01-04. An SMP kernel build from 2008-12-22 (r186409) sources was stable for over two weeks; a kernel built from 2008-12-29 (r186590) sources wedged in under 24 hours under moderate load. It appears that the significant changes between r186409 and r186590 were r186552 (delphij - reverted ATA changes) and r186535/r186534 (delphij - reverted bce changes). My machines don't have bce interfaces, so I suspect the ATA changes. Any thoughts? Thanks, Guy -- Guy Helmer, Ph.D. Chief System Architect Palisade Systems, Inc.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?49946769.1040009>