From owner-freebsd-stable@FreeBSD.ORG Fri Jan 25 20:05:51 2008 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7801816A419; Fri, 25 Jan 2008 20:05:51 +0000 (UTC) (envelope-from joe@boulder.swri.edu) Received: from mail.boulder.swri.edu (mail.boulder.swri.edu [65.241.78.2]) by mx1.freebsd.org (Postfix) with ESMTP id DA71113C467; Fri, 25 Jan 2008 20:05:50 +0000 (UTC) (envelope-from joe@boulder.swri.edu) Received: from [10.0.3.98] (antares.boulder.swri.edu [10.0.3.98]) (authenticated bits=0) by mail.boulder.swri.edu (8.13.5/8.13.5) with ESMTP id m0PJFace019646 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Fri, 25 Jan 2008 12:15:36 -0700 Message-ID: <479A356E.2030900@boulder.swri.edu> Date: Fri, 25 Jan 2008 12:15:58 -0700 From: Joe Peterson User-Agent: Thunderbird 2.0.0.9 (X11/20071119) MIME-Version: 1.0 To: Jeremy Chadwick References: <479A0731.6020405@skyrush.com> <20080125162940.GA38494@eos.sc1.parodius.com> In-Reply-To: <20080125162940.GA38494@eos.sc1.parodius.com> Content-Type: multipart/mixed; boundary="------------050608070403050000040900" Cc: freebsd-stable@FreeBSD.org Subject: Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2008 20:05:51 -0000 This is a multi-part message in MIME format. --------------050608070403050000040900 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Jeremy Chadwick wrote: > What you've shown is usually the sign of a disk-related problem. It's > very obvious when it's just one disk reporting DMA errors. You use ZFS, > so chances are you have more than one disk in a pool/volume -- there's > no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate > something specific to ad0. Jeremy, thanks for the response - I have tried to answer all of your questions below... In my case, I am using only one disk (ad0) for FreeBSD, and I am only using one partition on this disk in my ZFS pool. So, in this case, unfortunately, it's not possible to tell from the fact that only ad0 is listed that it is specific to this drive. > Manufacturers pick very passive (non-aggressive) thresholds for error > conditions on disks, so disks which are failing very commonly show > "PASSED" during SMART analysis. To make matters worse, most users I > know read SMART stats incorrectly (they're easy to misinterpret). Yep, I am also always skeptical of smart reports. That's one reason I am very interested in ZFS. I don't trust the drive to be completely reliable, and the fact that ZFS does end-to-end data integrity is very intriguing. > Can you please provide output of the following: > > * smartctl -a /dev/ad0 OK, I've attached this to the end of this email. > * atacontrol cap ad0 Protocol ATA/ATAPI revision 7 device model ST3500630A serial number 9QG0DG03 firmware revision 3.AAE cylinders 16383 heads 16 sectors/track 63 lba supported 268435455 sectors lba48 supported 976773168 sectors dma supported overlap not supported Feature Support Enable Value Vendor write cache yes yes read ahead yes yes Tagged Command Queuing (TCQ) no no 0/0x00 SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no 65278/0xFEFE automatic acoustic management no no 0/0x00 208/0xD0 > * atacontrol info Master: ad0 ATA/ATAPI revision 7 Slave: ad1 ATA/ATAPI revision 7 (but note that ad1 is not used by FreeBSD) > * Relevant dmesg output that indicates what kind of ATA controller > these disks are attached to. Start with output from 'ad0:' and > work backwards. For example, ad0 on this machine is using an Intel > ICH6 controller: > atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 > ata0: on atapci0 > ad0: 238475MB at ata0-master SATA150 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.1 on pci0 ata0: on atapci0 ata0: [ITHREAD] ad0: 476940MB at ata0-master UDMA100 > SMART stats which are labelled "Offline" are only updated when a short > or long offline test is performed. Have you tried using "smartctl -t > short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw > values on the far right column increment? I just tried one: # 1 Short offline Completed without error 00% 5252 - # 2 Short offline Completed without error 00% 5252 - Also, none of the numbers that were zero incremented, esp: 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 Also, no more errors were reported in the system log during the self-tests. > Have you tried using "zpool scrub" on the ZFS pool, then "zpool status" > to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line > states there were errors? OK, I started a scrub, and it will take some more time to complete... But I get the following with status. Could this be due to the timeouts and failures? I suspect so, so maybe this is not surprizing. I'd also guess that this doesn't necessarily point to the drive, but anything in the chain of events... I do not have a mirror or RADI-Z, so I guess the reason there was "no data loss" (yet) is because the checksum passed, and maybe it just had to retry...? Anyway, here's the output so far: pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub in progress, 2.50% done, 1h58m to go config: NAME STATE READ WRITE CKSUM tank ONLINE 1 3 0 ad0s1d ONLINE 1 3 0 errors: No known data errors > Other things which have fixed problems in the past for others: > > * BIOS updates > * Change of motherboards (sometimes replacing board with same model, > other times going with a completely different vendor (implies weird > implementation issues or BIOS problems)) I've been using this same motherboard/BIOS for a long time (as well as this drive), so no changes have happened to the HW recently. The BIOS is the newest, available, I believe (It's a Tyan Trinity S2099, so it's a few years old) > * Changing SATA cables I'm using regular ATA 80-pin cables. Also, these seem to have been working fine for quite a while now. But, yes, I have also witnessed bad cable issues on older systems in the past. I certainly could try a new cable and see if it helps. > * Getting a larger power supply (usually when lots of disk are involved) I only have two drives, so I think the PS has enough capacity in my case. Anyway, thanks for the reply and further questions. Let me know if anything I've sent back is helpful! Thanks, Joe --------------050608070403050000040900 Content-Type: text/plain; name="smartctl.out" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="smartctl.out" smartctl version 5.37 [i386-portbld-freebsd7.0] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 family Device Model: ST3500630A Serial Number: 9QG0DG03 Firmware Version: 3.AAE User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Jan 25 09:55:13 2008 MST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 163) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 114 071 006 Pre-fail Always - 82422948 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 56 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 286126605 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 5250 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 59 187 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 189 Unknown_Attribute 0x003a 100 100 000 Old_age Always - 0 190 Temperature_Celsius 0x0022 065 056 045 Old_age Always - 605749283 194 Temperature_Celsius 0x0022 035 044 000 Old_age Always - 35 (Lifetime Min/Max 0/15) 195 Hardware_ECC_Recovered 0x001a 063 046 000 Old_age Always - 166181300 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. --------------050608070403050000040900--