From owner-freebsd-stable@FreeBSD.ORG Tue Feb 8 06:46:35 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 94CCB1065697 for ; Tue, 8 Feb 2011 06:46:35 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta15.emeryville.ca.mail.comcast.net (qmta15.emeryville.ca.mail.comcast.net [76.96.27.228]) by mx1.freebsd.org (Postfix) with ESMTP id 722388FC12 for ; Tue, 8 Feb 2011 06:46:35 +0000 (UTC) Received: from omta03.emeryville.ca.mail.comcast.net ([76.96.30.27]) by qmta15.emeryville.ca.mail.comcast.net with comcast id 5Jkb1g0010b6N64AFJmany; Tue, 08 Feb 2011 06:46:34 +0000 Received: from koitsu.dyndns.org ([98.248.34.134]) by omta03.emeryville.ca.mail.comcast.net with comcast id 5JmZ1g0042tehsa8PJmaGw; Tue, 08 Feb 2011 06:46:34 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id A92A39B422; Mon, 7 Feb 2011 22:46:33 -0800 (PST) Date: Mon, 7 Feb 2011 22:46:33 -0800 From: Jeremy Chadwick To: Greg Bonett Message-ID: <20110208064633.GA3367@icarus.home.lan> References: <1297026074.23922.8.camel@ubuntu> <20110207045501.GA15568@icarus.home.lan> <1297065041.754.12.camel@ubuntu> <20110207085537.GA20545@icarus.home.lan> <1297143276.9417.400.camel@ubuntu> <20110208055239.GA2557@icarus.home.lan> <1297145806.9417.413.camel@ubuntu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1297145806.9417.413.camel@ubuntu> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-stable@freebsd.org Subject: Re: 8.1 amd64 lockup (maybe zfs or disk related) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Feb 2011 06:46:35 -0000 On Mon, Feb 07, 2011 at 10:16:46PM -0800, Greg Bonett wrote: > ok, I will start trying to locate the cause of the problem. I've > attached my dmesg output after boot. I'm currently downloading a liveCD > to run memtest from. When you say "rebuild your kernel with debugging > enabled" do you mean add the "makeoptions DEBUG=-g" option to my > kernel config and rebuild? No, but that would be a useful addition as well, assuming you have the disk space on your root filesystem for modules/kernel with debugging symbols. These are the options you want to add to your kernel config: # Debugging options options BREAK_TO_DEBUGGER # Sending a serial BREAK drops to DDB options KDB # Enable kernel debugger support options KDB_TRACE # Print stack trace automatically on panic options DDB # Support DDB options GDB # Support remote GDB Documented here: http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-options.html > Also, I'll start logging my cpu temp and I'll see if it peaks before a > lockup. (I have had one of six cores disabled thinking this might > prevent overheating) Unlikely. Present-day operating systems (including Windows for that matter) are pretty good about halting processors (cores) which aren't in use/aren't needed, which greatly helps with diminishing power usage and temperatures. Each CPU model is different, so you'd have to find someone with an AMD Phenom II X6 1075T CPU and compare thermals. > Thank you for your help talking me through this. > > I've attached my dmesg output as dmesg.log. Let's look at your storage controller setup: atapci0: irq 18 atapci1: on atapci0 ata2: on atapci1 ata3: on atapci1 ata4: on atapci0 atapci2: irq 19 atapci2: AHCI v1.20 controller with 4 6Gbps ports, PM supported ata5: on atapci2 ata6: on atapci2 ata7: on atapci2 ata8: on atapci2 atapci3: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0 ata0: on atapci3 ata1: on atapci3 There have been recent discussions about "problems" on the ATI IXP700/800 controllers. I do not buy AMD systems, so I can't comment on this controllers' reliability. Just a FYI point. Here's the thread: http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/thread.html#61348 I also tend to avoid JMicron controllers like the plague. I've seen too many problem reports with them over the years, regardless of OS. Now for the disk layout (I'm excluding da0, which is a USB flash disk of some kind). ad0: 953869MB at ata0-master UDMA133 SATA ad1: 953869MB at ata0-slave UDMA133 SATA ad4: 1430799MB at ata2-master UDMA100 SATA 3Gb/s ad8: 15279MB at ata4-master UDMA66 acd0: CDRW at ata4-slave UDMA33 ad10: 953869MB at ata5-master UDMA100 SATA 3Gb/s ad12: 953869MB at ata6-master UDMA100 SATA 3Gb/s ad14: 953869MB at ata7-master UDMA100 SATA 3Gb/s ad16: 953869MB at ata8-master UDMA100 SATA 3Gb/s You have a very large number of hard disks in this machine, so I sure hope you do have a decent enough PSU to handle it all. If I had to make a recommendation, it would be to decrease the number of hard disks in the machine. You have 8 of them -- one of which may be a RAM drive or something similar -- and that isn't including your CDRW drive. I would also try getting rid of the JMicron controller; I would recommend investing in a Silicon Image controller to replace it, specifically one driven by the 3124, 3132, or 3531 chips. Avoid the 3112, 3114, and 3512 chips: http://en.wikipedia.org/wiki/Silicon_Image#Product_alerts Next we have this: > ad1: TIMEOUT - READ_DMA retrying (1 retry left) LBA=1 > GEOM: ad1: partition 1 does not start on a track boundary. > GEOM: ad1: partition 1 does not end on a track boundary. > GEOM: label/1TBdisk5: partition 1 does not start on a track boundary. > GEOM: label/1TBdisk5: partition 1 does not end on a track boundary. This doesn't look good, especially the READ_DMA timeout on ad1. That's a different disk than the one you told me about before. LBA 1 is literally the 2nd block on the disk, which is a little too close to block 0 for comfort. I'd love to see "smartctl -a /dev/ad1" output here. > calcru: runtime went backwards from 82 usec to 70 usec for pid 20 (flowcleaner) > calcru: runtime went backwards from 363 usec to 317 usec for pid 8 (pagedaemon) > calcru: runtime went backwards from 111 usec to 95 usec for pid 7 (xpt_thrd) > calcru: runtime went backwards from 1892 usec to 1629 usec for pid 1 (init) > calcru: runtime went backwards from 6786 usec to 6591 usec for pid 0 (kernel) This is a problem that has plagued FreeBSD for some time. It's usually caused by EIST (est) being used, but that's on Intel platforms. AMD has something similar called Cool'n'Quiet (see cpufreq(4) man page). Are you running powerd(8) on this system? If so, try disabling that and see if these go away. > GEOM_ELI: Device label/1tbgreendisk.eli created. > GEOM_ELI: Encryption: AES-CBC 256 > GEOM_ELI: Crypto: software > {...} There was no mention of geli(8) being used on this system until now. There may be other complexities as a result of this; I don't know. Good luck. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |