Date: Mon, 7 Feb 2011 21:52:39 -0800 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Greg Bonett <greg@bonett.org> Cc: freebsd-stable <freebsd-stable@freebsd.org> Subject: Re: 8.1 amd64 lockup (maybe zfs or disk related) Message-ID: <20110208055239.GA2557@icarus.home.lan> In-Reply-To: <1297143276.9417.400.camel@ubuntu> References: <1297026074.23922.8.camel@ubuntu> <20110207045501.GA15568@icarus.home.lan> <1297065041.754.12.camel@ubuntu> <20110207085537.GA20545@icarus.home.lan> <1297143276.9417.400.camel@ubuntu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Feb 07, 2011 at 09:34:36PM -0800, Greg Bonett wrote: > Thank you for the help. I've implemented your > suggested /boot/loader.conf and /etc/sysctrl.conf tunings. > Unfortunately, after implementing these settings, I experienced another > lockup. And by "lockup" I mean, nothing responding (sshd, keyboard, num > lock) - had to reset. > > I'm trying to isolate the cause of these lockups. I rebooted the system > and tried to simulate high load condition WITHOUT mounting my zfs pool. > First I ran many instances of "dd if=/dev/random of=/dev/null bs=4m" to > get high CPU load. The machine ran for many hours under this condition > without lockup. Then I added a few "dd if=/dev/adX of=/dev/null bs=4m" > to simulate some io load. After doing this it locked up immediately. > Thinking I had figured out the source of the problem, I rebooted and > tried to replicate this experience but was not able to. So far it has > been running for two hours with six "dd if=/dev/adX" commands (one for > each disk) and about a dozen "dd if=/dev/urandom" commands (to keep cpu > near 100%). I'll let it keep running and see if it locks again without > ever mounting zfs. > > any ideas? No NumLock LED toggling is a pretty good indicator of a hardware-level problem. An extra test would be to rebuild your kernel with debugging enabled so that when the machine locks, you could try pressing Ctrl-Alt-Esc at the VGA console and see if you drop to a db> prompt. If so, that means the machine is actually alive (well, the kernel anyway). As for causes: you could have bad memory (memtest86+ is a decent free test, but not infallible), you could have a PSU that doesn't have decent voltage ranges on its 3V, 5V, or 12V lines, you could have a PSU that doesn't provide enough power for all the devices connected to it, you could have a bad motherboard, your CPU could be overheating, you could be encountering a strange hardware/silicon bug, there could be a small or thin slice of metal laying across a single trace on the motherboard, etc... The list is enormous. Hardware problems often require a person to spend a lot of time and money, replacing a single part at a time, until the problem goes away. The only thing we know for sure at this point is that your Western Digital drive behaves erratically with regards to excessive load cycling. That is almost certainly the reason for your READ_DMA48 errors. So, you may actually be experiencing two separate issues at the same time. It's hard to tell at this point. In the meantime, can you please provide output from "dmesg" after the machine comes up? I'm curious to know what sort of hardware is in this machine, especially with regards to its storage controller. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110208055239.GA2557>