From owner-freebsd-current@FreeBSD.ORG Mon Sep 14 15:21:42 2009 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 734AA106566B; Mon, 14 Sep 2009 15:21:42 +0000 (UTC) (envelope-from 000.fbsd@quip.cz) Received: from elsa.codelab.cz (elsa.codelab.cz [94.124.105.4]) by mx1.freebsd.org (Postfix) with ESMTP id 0F6098FC08; Mon, 14 Sep 2009 15:21:41 +0000 (UTC) Received: from localhost (localhost.codelab.cz [127.0.0.1]) by elsa.codelab.cz (Postfix) with ESMTP id 22C7A19E027; Mon, 14 Sep 2009 17:21:40 +0200 (CEST) Received: from [192.168.1.2] (r5bb235.net.upc.cz [86.49.61.235]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by elsa.codelab.cz (Postfix) with ESMTPSA id 1EA7019E019; Mon, 14 Sep 2009 17:21:36 +0200 (CEST) Message-ID: <4AAE5F7E.4050103@quip.cz> Date: Mon, 14 Sep 2009 17:21:34 +0200 From: Miroslav Lachman <000.fbsd@quip.cz> User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 X-Accept-Language: cz, cs, en, en-us MIME-Version: 1.0 To: Alexandre Sunny References: <4AAD4E51.5060908@FreeBSD.org> <4AAD5365.5000902@FreeBSD.org> <4AAD5DD2.4030104@FreeBSD.org> <20090914100941.0adc00aa@Nokia-N810-43-7> In-Reply-To: <20090914100941.0adc00aa@Nokia-N810-43-7> Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: Kris Kennaway , Alexander Motin , FreeBSD Current Subject: Re: ata timeouts under load X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Sep 2009 15:21:42 -0000 Alexandre Sunny wrote: > On Sun, 13 Sep 2009 22:02:10 +0100 > Kris Kennaway wrote: > > >>Alexander Motin wrote: >> >>>Kris Kennaway wrote: [...] >>>There are two different kinds of timeouts we can see: >>> - first one, "ad4: WARNING - ..." is just a queue waiting timeout. >>>It is not the reason, but consequence of the problem. And I have >>>doubts that it is reasonable to do it. >>> - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command >>>execution timeout. I don't know whether this is result of some >>>improper error recovery, or you drive indeed lost required servo >>>information near LBA=344052040 and tries to find it too long. You >>>can try to read that sector and nearby ones with dd. >>> >> >>It's always that sequence (with setfeatures timing out first, then >>the dma later)...and the block number varies widely, also whether >>it's read/write. The disk itself & the data it contains appears to >>be OK as far as I have been able to determine so far. > > > Does smartctl -A /dev/ad4 report "Seek Error Rate" and/or "ECC Error > Rate", and, if so, do those values change while errors are being > reported? > > "Replaced Sector Count" or something similar might give some insight > too. I have very similar problem with one disk in gmirror, but it is on 7.2 not current. Sep 14 04:48:29 jimi kernel: ad6: timeout waiting to issue command Sep 14 04:48:29 jimi kernel: ad6: error issuing FLUSHCACHE command Sep 14 04:48:29 jimi kernel: ad6: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=447001516 Sep 14 04:48:29 jimi kernel: ad6: FAILURE - READ_DMA48 status=51 error=4 LBA=447001516 Sep 14 04:48:29 jimi kernel: GEOM_MIRROR: Request failed (error=5). ad6[READ(offset=228864776192, length=2048)] Sep 14 04:48:29 jimi kernel: GEOM_MIRROR: Device gm0: provider ad6 disconnected. But no errors in SMART log: Device Model: Hitachi HDP725050GLA360 Firmware Version: GM4OA52A User Capacity: 500,107,862,016 bytes SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 1 2 Throughput_Performance 0x0005 130 130 054 Pre-fail Offline - 151 3 Spin_Up_Time 0x0007 116 116 024 Pre-fail Always - 312 (Average 350) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 23 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 129 129 020 Pre-fail Offline - 30 9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 13911 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 23 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 545 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 545 194 Temperature_Celsius 0x0002 240 240 000 Old_age Always - 25 (Lifetime Min/Max 20/34) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0 As it was discussed many times - it should be fixed by increasing the hardcoded timouts. Is it time to make the ATA timeout sysctl tunables? There were patches from FreeNAS and some PRs about longer timeouts. kern/136182: [ata] Heavy disk writes (e.g. ZFS resilver to a drive) can cause "adX: TIMEOUT - FLUSHCACHE retrying (1 retry left)" on console. http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/136182 kern/111023: [ata] [request] [patch] please expand ata timeouts http://www.freebsd.org/cgi/query-pr.cgi?pr=111023 ATA/SATA DMA timeout issues http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting#line-53 HowTo: Fix SATA DMA timeout issues on FreeBSD http://linux-bsd-sharing.blogspot.com/2009/03/howto-fix-sata-dma-timeout-issues-on.html Western Digital hard disks and ATA timeouts http://www.mail-archive.com/freebsd-hardware@freebsd.org/msg03135.html ata FLUSHCACHE timeout errors? [patch] http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005939.html And I am sure, you can find many more reports floating around. Miroslav Lachman