From owner-freebsd-stable@FreeBSD.ORG Wed Feb 9 07:07:25 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 981B71065693 for ; Wed, 9 Feb 2011 07:07:25 +0000 (UTC) (envelope-from greg@bonett.org) Received: from bonett.org (bonett.org [66.249.7.150]) by mx1.freebsd.org (Postfix) with ESMTP id 54F128FC0C for ; Wed, 9 Feb 2011 07:07:25 +0000 (UTC) Received: from [192.168.1.216] (unknown [76.91.19.169]) by bonett.org (Postfix) with ESMTPSA id 891FB124098; Wed, 9 Feb 2011 07:07:22 +0000 (UTC) From: Greg Bonett To: Jeremy Chadwick In-Reply-To: <20110208064633.GA3367@icarus.home.lan> References: <1297026074.23922.8.camel@ubuntu> <20110207045501.GA15568@icarus.home.lan> <1297065041.754.12.camel@ubuntu> <20110207085537.GA20545@icarus.home.lan> <1297143276.9417.400.camel@ubuntu> <20110208055239.GA2557@icarus.home.lan> <1297145806.9417.413.camel@ubuntu> <20110208064633.GA3367@icarus.home.lan> Content-Type: multipart/mixed; boundary="=-tk2Zc8zBx+cz/jsOhEgH" Date: Tue, 08 Feb 2011 23:07:21 -0800 Message-ID: <1297235241.4729.35.camel@ubuntu> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Cc: freebsd-stable@freebsd.org Subject: Re: 8.1 amd64 lockup (maybe zfs or disk related) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 09 Feb 2011 07:07:25 -0000 --=-tk2Zc8zBx+cz/jsOhEgH Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit ok, I think you're right - there is more than one problem with this system, but I think I'm starting to isolate them and make some progress. > # Debugging options > options BREAK_TO_DEBUGGER # Sending a serial BREAK drops to DDB > options KDB # Enable kernel debugger support > options KDB_TRACE # Print stack trace automatically on panic > options DDB # Support DDB > options GDB # Support remote GDB > > Documented here: > http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-options.html rebuilt my kernel with debug options, but thankfully I think I've learned how to avoid lockup for the time being. I think I am asking too much of my 650 watt power supply. I unplugged one hard drive and disabled another CPU core (now running 4 of 6). I'm sad to lose the horsepower, but I was able to complete an entire zpool scrub and other high load tasks without a lockup. > Let's look at your storage controller setup: > > atapci0: irq 18 > atapci1: on atapci0 > ata2: on atapci1 > ata3: on atapci1 > ata4: on atapci0 > atapci2: irq 19 > atapci2: AHCI v1.20 controller with 4 6Gbps ports, PM supported > ata5: on atapci2 > ata6: on atapci2 > ata7: on atapci2 > ata8: on atapci2 > atapci3: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0 > ata0: on atapci3 > ata1: on atapci3 > > There have been recent discussions about "problems" on the ATI > IXP700/800 controllers. I do not buy AMD systems, so I can't comment on > this controllers' reliability. Just a FYI point. Here's the thread: > > http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/thread.html#61348 > > I also tend to avoid JMicron controllers like the plague. I've seen too > many problem reports with them over the years, regardless of OS. I'll look into this. I think the controller is the source of the "FAILURE - READ_LMA48" errors. I switched the disk/sata port pairing and the error stayed with the sata port, not the disk. > Now for the disk layout (I'm excluding da0, which is a USB flash disk of > some kind). > > ad0: 953869MB at ata0-master UDMA133 SATA > ad1: 953869MB at ata0-slave UDMA133 SATA > ad4: 1430799MB at ata2-master UDMA100 SATA 3Gb/s > ad8: 15279MB at ata4-master UDMA66 > acd0: CDRW at ata4-slave UDMA33 > ad10: 953869MB at ata5-master UDMA100 SATA 3Gb/s > ad12: 953869MB at ata6-master UDMA100 SATA 3Gb/s > ad14: 953869MB at ata7-master UDMA100 SATA 3Gb/s > ad16: 953869MB at ata8-master UDMA100 SATA 3Gb/s > > You have a very large number of hard disks in this machine, so I sure > hope you do have a decent enough PSU to handle it all. > > If I had to make a recommendation, it would be to decrease the number of > hard disks in the machine. You have 8 of them -- one of which may be a > RAM drive or something similar -- and that isn't including your CDRW > drive. Yes, I think this is the problem. Though, for clarification, there are only 6 spindle disks in the machine. ad4 is an external drive over esata (with it's own power), and ad8 is a CF drive. > I would also try getting rid of the JMicron controller; I would > recommend investing in a Silicon Image controller to replace it, > specifically one driven by the 3124, 3132, or 3531 chips. Avoid the > 3112, 3114, and 3512 chips: > http://en.wikipedia.org/wiki/Silicon_Image#Product_alerts Thanks for the recommendation. I'll probably pick one of these up along with a new power supply. > Next we have this: > > > ad1: TIMEOUT - READ_DMA retrying (1 retry left) LBA=1 > > GEOM: ad1: partition 1 does not start on a track boundary. > > GEOM: ad1: partition 1 does not end on a track boundary. > > GEOM: label/1TBdisk5: partition 1 does not start on a track boundary. > > GEOM: label/1TBdisk5: partition 1 does not end on a track boundary. > > This doesn't look good, especially the READ_DMA timeout on ad1. That's > a different disk than the one you told me about before. LBA 1 is > literally the 2nd block on the disk, which is a little too close to > block 0 for comfort. I'd love to see "smartctl -a /dev/ad1" output > here. I've attached the output of smartctl -a /dev/ad1. I don't think this error is being caused by the disk though. As I said above, I changed the sata port / drive pairing and this error stays with the sata port, not the drive. (so, as you said, time for a new controller) > > calcru: runtime went backwards from 82 usec to 70 usec for pid 20 (flowcleaner) > > calcru: runtime went backwards from 363 usec to 317 usec for pid 8 (pagedaemon) > > calcru: runtime went backwards from 111 usec to 95 usec for pid 7 (xpt_thrd) > > calcru: runtime went backwards from 1892 usec to 1629 usec for pid 1 (init) > > calcru: runtime went backwards from 6786 usec to 6591 usec for pid 0 (kernel) > > This is a problem that has plagued FreeBSD for some time. It's usually > caused by EIST (est) being used, but that's on Intel platforms. AMD has > something similar called Cool'n'Quiet (see cpufreq(4) man page). Are > you running powerd(8) on this system? If so, try disabling that and see > if these go away. sadly, I don't know if I'm running powerd. ps aux | grep power gives nothing, so no I guess... as far as I can tell, this error is the least of my problems right now, but i would like to fix it. > > GEOM_ELI: Device label/1tbgreendisk.eli created. > > GEOM_ELI: Encryption: AES-CBC 256 > > GEOM_ELI: Crypto: software > > {...} > > There was no mention of geli(8) being used on this system until now. > There may be other complexities as a result of this; I don't know. yeah, geli is being used on this system, sorry i forgot to mention that > Good luck. > Thanks for the help, I'm at least able to keep the machine online now. --=-tk2Zc8zBx+cz/jsOhEgH Content-Disposition: attachment; filename="ad1.smart" Content-Type: text/plain; name="ad1.smart"; charset="UTF-8" Content-Transfer-Encoding: 7bit smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.1-RELEASE-p2 amd64] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 family Device Model: ST31000333AS Serial Number: 9TE1MB10 Firmware Version: CC1H User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Tue Feb 8 07:41:31 2011 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 617) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 208) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always - 243069120 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 84 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 079 060 030 Pre-fail Always - 83902794 9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 14308 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 84 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 093 000 Old_age Always - 56 189 High_Fly_Writes 0x003a 017 017 000 Old_age Always - 83 190 Airflow_Temperature_Cel 0x0022 076 051 045 Old_age Always - 24 (Min/Max 24/24) 194 Temperature_Celsius 0x0022 024 049 000 Old_age Always - 24 (0 17 0 0) 195 Hardware_ECC_Recovered 0x001a 050 019 000 Old_age Always - 243069120 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 96619584305016 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 412576321 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 2438661969 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 13221 - # 2 Extended offline Interrupted (host reset) 90% 13216 - # 3 Short offline Completed without error 00% 13207 - # 4 Extended offline Interrupted (host reset) 50% 13199 - # 5 Extended offline Completed without error 00% 13134 - # 6 Conveyance offline Completed without error 00% 13131 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. --=-tk2Zc8zBx+cz/jsOhEgH--