Date: Mon, 7 Feb 2011 22:46:33 -0800 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Greg Bonett <greg@bonett.org> Cc: freebsd-stable@freebsd.org Subject: Re: 8.1 amd64 lockup (maybe zfs or disk related) Message-ID: <20110208064633.GA3367@icarus.home.lan> In-Reply-To: <1297145806.9417.413.camel@ubuntu> References: <1297026074.23922.8.camel@ubuntu> <20110207045501.GA15568@icarus.home.lan> <1297065041.754.12.camel@ubuntu> <20110207085537.GA20545@icarus.home.lan> <1297143276.9417.400.camel@ubuntu> <20110208055239.GA2557@icarus.home.lan> <1297145806.9417.413.camel@ubuntu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Feb 07, 2011 at 10:16:46PM -0800, Greg Bonett wrote: > ok, I will start trying to locate the cause of the problem. I've > attached my dmesg output after boot. I'm currently downloading a liveCD > to run memtest from. When you say "rebuild your kernel with debugging > enabled" do you mean add the "makeoptions DEBUG=-g" option to my > kernel config and rebuild? No, but that would be a useful addition as well, assuming you have the disk space on your root filesystem for modules/kernel with debugging symbols. These are the options you want to add to your kernel config: # Debugging options options BREAK_TO_DEBUGGER # Sending a serial BREAK drops to DDB options KDB # Enable kernel debugger support options KDB_TRACE # Print stack trace automatically on panic options DDB # Support DDB options GDB # Support remote GDB Documented here: http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-options.html > Also, I'll start logging my cpu temp and I'll see if it peaks before a > lockup. (I have had one of six cores disabled thinking this might > prevent overheating) Unlikely. Present-day operating systems (including Windows for that matter) are pretty good about halting processors (cores) which aren't in use/aren't needed, which greatly helps with diminishing power usage and temperatures. Each CPU model is different, so you'd have to find someone with an AMD Phenom II X6 1075T CPU and compare thermals. > Thank you for your help talking me through this. > > I've attached my dmesg output as dmesg.log. Let's look at your storage controller setup: atapci0: <JMicron JMB361 UDMA133 controller> irq 18 atapci1: <AHCI SATA controller> on atapci0 ata2: <ATA channel 0> on atapci1 ata3: <ATA channel 1> on atapci1 ata4: <ATA channel 0> on atapci0 atapci2: <ATI IXP700/800 SATA300 controller> irq 19 atapci2: AHCI v1.20 controller with 4 6Gbps ports, PM supported ata5: <ATA channel 0> on atapci2 ata6: <ATA channel 1> on atapci2 ata7: <ATA channel 2> on atapci2 ata8: <ATA channel 3> on atapci2 atapci3: <ATI IXP700/800 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0 ata0: <ATA channel 0> on atapci3 ata1: <ATA channel 1> on atapci3 There have been recent discussions about "problems" on the ATI IXP700/800 controllers. I do not buy AMD systems, so I can't comment on this controllers' reliability. Just a FYI point. Here's the thread: http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/thread.html#61348 I also tend to avoid JMicron controllers like the plague. I've seen too many problem reports with them over the years, regardless of OS. Now for the disk layout (I'm excluding da0, which is a USB flash disk of some kind). ad0: 953869MB <WDC WD10EARS-00Y5B1 80.00A80> at ata0-master UDMA133 SATA ad1: 953869MB <Seagate ST31000333AS CC1H> at ata0-slave UDMA133 SATA ad4: 1430799MB <WL1500GSA6472 05.00F.1> at ata2-master UDMA100 SATA 3Gb/s ad8: 15279MB <TRANSCEND 20091215> at ata4-master UDMA66 acd0: CDRW <NEC CD-RW NR-7900A/1.08> at ata4-slave UDMA33 ad10: 953869MB <WL1000GSA1672 05.00J05> at ata5-master UDMA100 SATA 3Gb/s ad12: 953869MB <Seagate ST31000333AS CC1H> at ata6-master UDMA100 SATA 3Gb/s ad14: 953869MB <SAMSUNG HD103UJ 1AA01118> at ata7-master UDMA100 SATA 3Gb/s ad16: 953869MB <WL1000GSA1672 HA.00CHA> at ata8-master UDMA100 SATA 3Gb/s You have a very large number of hard disks in this machine, so I sure hope you do have a decent enough PSU to handle it all. If I had to make a recommendation, it would be to decrease the number of hard disks in the machine. You have 8 of them -- one of which may be a RAM drive or something similar -- and that isn't including your CDRW drive. I would also try getting rid of the JMicron controller; I would recommend investing in a Silicon Image controller to replace it, specifically one driven by the 3124, 3132, or 3531 chips. Avoid the 3112, 3114, and 3512 chips: http://en.wikipedia.org/wiki/Silicon_Image#Product_alerts Next we have this: > ad1: TIMEOUT - READ_DMA retrying (1 retry left) LBA=1 > GEOM: ad1: partition 1 does not start on a track boundary. > GEOM: ad1: partition 1 does not end on a track boundary. > GEOM: label/1TBdisk5: partition 1 does not start on a track boundary. > GEOM: label/1TBdisk5: partition 1 does not end on a track boundary. This doesn't look good, especially the READ_DMA timeout on ad1. That's a different disk than the one you told me about before. LBA 1 is literally the 2nd block on the disk, which is a little too close to block 0 for comfort. I'd love to see "smartctl -a /dev/ad1" output here. > calcru: runtime went backwards from 82 usec to 70 usec for pid 20 (flowcleaner) > calcru: runtime went backwards from 363 usec to 317 usec for pid 8 (pagedaemon) > calcru: runtime went backwards from 111 usec to 95 usec for pid 7 (xpt_thrd) > calcru: runtime went backwards from 1892 usec to 1629 usec for pid 1 (init) > calcru: runtime went backwards from 6786 usec to 6591 usec for pid 0 (kernel) This is a problem that has plagued FreeBSD for some time. It's usually caused by EIST (est) being used, but that's on Intel platforms. AMD has something similar called Cool'n'Quiet (see cpufreq(4) man page). Are you running powerd(8) on this system? If so, try disabling that and see if these go away. > GEOM_ELI: Device label/1tbgreendisk.eli created. > GEOM_ELI: Encryption: AES-CBC 256 > GEOM_ELI: Crypto: software > {...} There was no mention of geli(8) being used on this system until now. There may be other complexities as a result of this; I don't know. Good luck. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110208064633.GA3367>