From owner-freebsd-stable@FreeBSD.ORG  Tue Feb 23 17:35:53 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EED23106566B
	for <freebsd-stable@freebsd.org>; Tue, 23 Feb 2010 17:35:52 +0000 (UTC)
	(envelope-from mavbsd@gmail.com)
Received: from mail-fx0-f223.google.com (mail-fx0-f223.google.com
	[209.85.220.223])
	by mx1.freebsd.org (Postfix) with ESMTP id 79B598FC08
	for <freebsd-stable@freebsd.org>; Tue, 23 Feb 2010 17:35:52 +0000 (UTC)
Received: by fxm23 with SMTP id 23so20920fxm.3
	for <freebsd-stable@freebsd.org>; Tue, 23 Feb 2010 09:35:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:sender:message-id:date:from
	:user-agent:mime-version:to:cc:subject:references:in-reply-to
	:x-enigmail-version:content-type:content-transfer-encoding;
	bh=BCy/3iJ1GKX7MmlXyHH2NCcGqVJqH78MKp3LC7PU94g=;
	b=EtUE5Z93QHF/OMyvCajuZWSNQmZlenw6qmjDe9GRuy6zdTIHRyrV/+7szrgHPqdupY
	8xhbFHkX88FXzaWHmsuYVv3imODz0Rwtk+Yr78O3Zc4325kJObCJrQ5ufQYSYrT3iTEI
	5P2r8izlB6UmnwBvu6x6USgW2UYwpDiXmcNo8=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:x-enigmail-version:content-type
	:content-transfer-encoding;
	b=UzXfol0aLT/+PMiV5vgtrcq3I3V5WWqvpEj8FmDnKxHr3VZdqGQmeobA/mUfi3G/Yg
	FIIW1GlKociUBUmBTxiNe6xcOY13AZAy97iYgEGAnk9ANRnPwEnWDg3M5zi7TXgOvp2z
	u69iMpROLUk2pMzHT/GyagrhvOyFPpRB5sBpk=
Received: by 10.223.17.155 with SMTP id s27mr7574913faa.13.1266946546322;
	Tue, 23 Feb 2010 09:35:46 -0800 (PST)
Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226])
	by mx.google.com with ESMTPS id 15sm2644037fxm.8.2010.02.23.09.35.45
	(version=SSLv3 cipher=RC4-MD5); Tue, 23 Feb 2010 09:35:45 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <4B8411EE.5030909@FreeBSD.org>
Date: Tue, 23 Feb 2010 19:35:42 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Thunderbird 2.0.0.23 (X11/20091212)
MIME-Version: 1.0
To: Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
References: <1266934981.00222684.1266922202@10.7.7.3>
	<4B83EFD4.8050403@FreeBSD.org> <4B83FD62.2020407@omnilan.de>
	<4B83FFEF.7010509@FreeBSD.org> <4B840C54.3010304@omnilan.de>
In-Reply-To: <4B840C54.3010304@omnilan.de>
X-Enigmail-Version: 0.96.0
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
Cc: freebsd-stable@FreeBSD.org
Subject: Re: ahcich timeouts, only with ahci, not with ataahci
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Feb 2010 17:35:53 -0000

Harald Schmalzbauer wrote:
> Alexander Motin schrieb am 23.02.2010 17:18 (localtime):
> ...
>>> I guess if it's a HDD firmware issue with NCQ the hang shouldn't happen
>>> when NCQ is disabled.
>>
>> Just for case of real I/O timeout, run full surface test with SMART.
> 
> Unfortunately I couldn't find new firmware from Samsung, although one
> drive shows version 1AG01113 while the other two have 1AG01118. But the
> timeout happened at different channels, so it's not one certain disk...
> 
> One understanding question: If the drive doesn't complete a command,
> regardless if it's due to a firmware bug, a disk surface error or
> whatever, is there no way for the driver to terminate the request and
> take the drive offline after some time? This would be a very important
> behaviour for me. It doesn't make sense building RAIDz storage when a
> failing drive hangs the complete machine, even if the system partitions
> are on a complete different SSD.

That's what timeouts are used for. When timeout detected, driver resets
device and reports error to upper layer. After receiving error, CAM
reinitializes device. If device is completely dead, reinitialization
will fail and device will be dropped immediately. If device is still
alive, reinit succeed and CAM will retry command again. If all retries
failed, error reported to the GEOM layer and then possibly to file
system. I have no idea how RAIDZ behaves in such case. May be after few
such errors it should drop that device out of array.

Timeout is a worst possible case for any device, as it takes too much
time and doesn't give any recovery information. Half-dead case is worst
possible case of timeout. It is difficult to say what which way is
better: drop last drive from degraded array and lost all info, or retry
forever. There is probably no right answer.

-- 
Alexander Motin