From owner-freebsd-current@FreeBSD.ORG  Mon Sep 14 15:21:42 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 734AA106566B;
	Mon, 14 Sep 2009 15:21:42 +0000 (UTC)
	(envelope-from 000.fbsd@quip.cz)
Received: from elsa.codelab.cz (elsa.codelab.cz [94.124.105.4])
	by mx1.freebsd.org (Postfix) with ESMTP id 0F6098FC08;
	Mon, 14 Sep 2009 15:21:41 +0000 (UTC)
Received: from localhost (localhost.codelab.cz [127.0.0.1])
	by elsa.codelab.cz (Postfix) with ESMTP id 22C7A19E027;
	Mon, 14 Sep 2009 17:21:40 +0200 (CEST)
Received: from [192.168.1.2] (r5bb235.net.upc.cz [86.49.61.235])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by elsa.codelab.cz (Postfix) with ESMTPSA id 1EA7019E019;
	Mon, 14 Sep 2009 17:21:36 +0200 (CEST)
Message-ID: <4AAE5F7E.4050103@quip.cz>
Date: Mon, 14 Sep 2009 17:21:34 +0200
From: Miroslav Lachman <000.fbsd@quip.cz>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US;
	rv:1.7.12) Gecko/20050915
X-Accept-Language: cz, cs, en, en-us
MIME-Version: 1.0
To: Alexandre Sunny <gaijin.k@gmail.com>
References: <4AAD4E51.5060908@FreeBSD.org>
	<4AAD5365.5000902@FreeBSD.org>	<4AAD5DD2.4030104@FreeBSD.org>
	<20090914100941.0adc00aa@Nokia-N810-43-7>
In-Reply-To: <20090914100941.0adc00aa@Nokia-N810-43-7>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Kris Kennaway <kris@FreeBSD.org>, Alexander Motin <mav@FreeBSD.org>,
	FreeBSD Current <current@freebsd.org>
Subject: Re: ata timeouts under load
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Sep 2009 15:21:42 -0000

Alexandre Sunny wrote:
> On Sun, 13 Sep 2009 22:02:10 +0100
> Kris Kennaway <kris@FreeBSD.org> wrote:
> 
> 
>>Alexander Motin wrote:
>>
>>>Kris Kennaway wrote:

[...]

>>>There are two different kinds of timeouts we can see:
>>> - first one, "ad4: WARNING - ..." is just a queue waiting timeout.
>>>It is not the reason, but consequence of the problem. And I have
>>>doubts that it is reasonable to do it.
>>> - second one, "TIMEOUT - WRITE_DMA48 ..." is a real command
>>>execution timeout. I don't know whether this is result of some
>>>improper error recovery, or you drive indeed lost required servo
>>>information near LBA=344052040 and tries to find it too long. You
>>>can try to read that sector and nearby ones with dd.
>>>
>>
>>It's always that sequence (with setfeatures timing out first, then
>>the dma later)...and the block number varies widely, also whether
>>it's read/write.  The disk itself & the data it contains appears to
>>be OK as far as I have been able to determine so far.
> 
> 
> Does smartctl -A /dev/ad4 report "Seek Error Rate" and/or "ECC Error
> Rate", and, if so, do those values change while errors are being
> reported?
> 
> "Replaced Sector Count" or something similar might give some insight
> too.


I have very similar problem with one disk in gmirror, but it is on 7.2 
not current.

Sep 14 04:48:29 jimi kernel: ad6: timeout waiting to issue command
Sep 14 04:48:29 jimi kernel: ad6: error issuing FLUSHCACHE command
Sep 14 04:48:29 jimi kernel: ad6: TIMEOUT - READ_DMA48 retrying (1 retry 
left) LBA=447001516
Sep 14 04:48:29 jimi kernel: ad6: FAILURE - READ_DMA48 
status=51<READY,DSC,ERROR> error=4<ABORTED> LBA=447001516
Sep 14 04:48:29 jimi kernel: GEOM_MIRROR: Request failed (error=5). 
ad6[READ(offset=228864776192, length=2048)]
Sep 14 04:48:29 jimi kernel: GEOM_MIRROR: Device gm0: provider ad6 
disconnected.


But no errors in SMART log:

Device Model:     Hitachi HDP725050GLA360
Firmware Version: GM4OA52A
User Capacity:    500,107,862,016 bytes

SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE 
UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail 
Always       -       1
   2 Throughput_Performance  0x0005   130   130   054    Pre-fail 
Offline      -       151
   3 Spin_Up_Time            0x0007   116   116   024    Pre-fail 
Always       -       312 (Average 350)
   4 Start_Stop_Count        0x0012   100   100   000    Old_age 
Always       -       23
   5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail 
Always       -       0
   7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail 
Always       -       0
   8 Seek_Time_Performance   0x0005   129   129   020    Pre-fail 
Offline      -       30
   9 Power_On_Hours          0x0012   099   099   000    Old_age 
Always       -       13911
  10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail 
Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   000    Old_age 
Always       -       23
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always 
       -       545
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always 
       -       545
194 Temperature_Celsius     0x0002   240   240   000    Old_age   Always 
       -       25 (Lifetime Min/Max 20/34)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always 
       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always 
       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age 
Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always 
       -       0


As it was discussed many times - it should be fixed by increasing the 
hardcoded timouts.
Is it time to make the ATA timeout sysctl tunables?
There were patches from FreeNAS and some PRs about longer timeouts.

kern/136182: [ata] Heavy disk writes (e.g. ZFS resilver to a drive) can 
cause "adX: TIMEOUT - FLUSHCACHE retrying (1 retry left)" on console.
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/136182

kern/111023: [ata] [request] [patch] please expand ata timeouts
http://www.freebsd.org/cgi/query-pr.cgi?pr=111023

ATA/SATA DMA timeout issues
http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting#line-53

HowTo: Fix SATA DMA timeout issues on FreeBSD
http://linux-bsd-sharing.blogspot.com/2009/03/howto-fix-sata-dma-timeout-issues-on.html

Western Digital hard disks and ATA timeouts
http://www.mail-archive.com/freebsd-hardware@freebsd.org/msg03135.html

ata FLUSHCACHE timeout errors? [patch]
http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005939.html

And I am sure, you can find many more reports floating around.

Miroslav Lachman