From owner-freebsd-stable@FreeBSD.ORG Tue Feb 23 17:35:53 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EED23106566B for ; Tue, 23 Feb 2010 17:35:52 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-fx0-f223.google.com (mail-fx0-f223.google.com [209.85.220.223]) by mx1.freebsd.org (Postfix) with ESMTP id 79B598FC08 for ; Tue, 23 Feb 2010 17:35:52 +0000 (UTC) Received: by fxm23 with SMTP id 23so20920fxm.3 for ; Tue, 23 Feb 2010 09:35:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:sender:message-id:date:from :user-agent:mime-version:to:cc:subject:references:in-reply-to :x-enigmail-version:content-type:content-transfer-encoding; bh=BCy/3iJ1GKX7MmlXyHH2NCcGqVJqH78MKp3LC7PU94g=; b=EtUE5Z93QHF/OMyvCajuZWSNQmZlenw6qmjDe9GRuy6zdTIHRyrV/+7szrgHPqdupY 8xhbFHkX88FXzaWHmsuYVv3imODz0Rwtk+Yr78O3Zc4325kJObCJrQ5ufQYSYrT3iTEI 5P2r8izlB6UmnwBvu6x6USgW2UYwpDiXmcNo8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:x-enigmail-version:content-type :content-transfer-encoding; b=UzXfol0aLT/+PMiV5vgtrcq3I3V5WWqvpEj8FmDnKxHr3VZdqGQmeobA/mUfi3G/Yg FIIW1GlKociUBUmBTxiNe6xcOY13AZAy97iYgEGAnk9ANRnPwEnWDg3M5zi7TXgOvp2z u69iMpROLUk2pMzHT/GyagrhvOyFPpRB5sBpk= Received: by 10.223.17.155 with SMTP id s27mr7574913faa.13.1266946546322; Tue, 23 Feb 2010 09:35:46 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua [212.86.226.226]) by mx.google.com with ESMTPS id 15sm2644037fxm.8.2010.02.23.09.35.45 (version=SSLv3 cipher=RC4-MD5); Tue, 23 Feb 2010 09:35:45 -0800 (PST) Sender: Alexander Motin Message-ID: <4B8411EE.5030909@FreeBSD.org> Date: Tue, 23 Feb 2010 19:35:42 +0200 From: Alexander Motin User-Agent: Thunderbird 2.0.0.23 (X11/20091212) MIME-Version: 1.0 To: Harald Schmalzbauer References: <1266934981.00222684.1266922202@10.7.7.3> <4B83EFD4.8050403@FreeBSD.org> <4B83FD62.2020407@omnilan.de> <4B83FFEF.7010509@FreeBSD.org> <4B840C54.3010304@omnilan.de> In-Reply-To: <4B840C54.3010304@omnilan.de> X-Enigmail-Version: 0.96.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: freebsd-stable@FreeBSD.org Subject: Re: ahcich timeouts, only with ahci, not with ataahci X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Feb 2010 17:35:53 -0000 Harald Schmalzbauer wrote: > Alexander Motin schrieb am 23.02.2010 17:18 (localtime): > ... >>> I guess if it's a HDD firmware issue with NCQ the hang shouldn't happen >>> when NCQ is disabled. >> >> Just for case of real I/O timeout, run full surface test with SMART. > > Unfortunately I couldn't find new firmware from Samsung, although one > drive shows version 1AG01113 while the other two have 1AG01118. But the > timeout happened at different channels, so it's not one certain disk... > > One understanding question: If the drive doesn't complete a command, > regardless if it's due to a firmware bug, a disk surface error or > whatever, is there no way for the driver to terminate the request and > take the drive offline after some time? This would be a very important > behaviour for me. It doesn't make sense building RAIDz storage when a > failing drive hangs the complete machine, even if the system partitions > are on a complete different SSD. That's what timeouts are used for. When timeout detected, driver resets device and reports error to upper layer. After receiving error, CAM reinitializes device. If device is completely dead, reinitialization will fail and device will be dropped immediately. If device is still alive, reinit succeed and CAM will retry command again. If all retries failed, error reported to the GEOM layer and then possibly to file system. I have no idea how RAIDZ behaves in such case. May be after few such errors it should drop that device out of array. Timeout is a worst possible case for any device, as it takes too much time and doesn't give any recovery information. Half-dead case is worst possible case of timeout. It is difficult to say what which way is better: drop last drive from degraded array and lost all info, or retry forever. There is probably no right answer. -- Alexander Motin