From owner-freebsd-stable@FreeBSD.ORG Wed Feb 15 00:15:14 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3C45C106574D for ; Wed, 15 Feb 2012 00:15:14 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 102B98FC0C for ; Wed, 15 Feb 2012 00:15:12 +0000 (UTC) Received: from localhost.samsco.home (pooker.samsco.org [168.103.85.57]) (authenticated bits=0) by pooker.samsco.org (8.14.5/8.14.5) with ESMTP id q1ENgmt6064363; Tue, 14 Feb 2012 16:42:48 -0700 (MST) (envelope-from scottl@samsco.org) Mime-Version: 1.0 (Apple Message framework v1251.1) Content-Type: text/plain; charset=iso-8859-1 From: Scott Long In-Reply-To: <20120214233420.GU2010@equilibrium.bsdes.net> Date: Tue, 14 Feb 2012 16:42:47 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <6D5E973B-6D98-41D7-B5E9-64A497F0F9F5@samsco.org> References: <20120214091909.GP2010@equilibrium.bsdes.net> <20120214100513.GA94501@icarus.home.lan> <20120214135435.GQ2010@equilibrium.bsdes.net> <20120214141601.GA98986@icarus.home.lan> <4F3A83DE.3000200@ambtec.de> <20120214165029.GA1852@icarus.home.lan> <4F3A971F.9040407@omnilan.de> <20120214221527.GT2010@equilibrium.bsdes.net> <20120214230958.GA8434@icarus.home.lan> <20120214233420.GU2010@equilibrium.bsdes.net> To: Victor Balada Diaz X-Mailer: Apple Mail (2.1251.1) X-Spam-Status: No, score=-50.0 required=3.8 tests=ALL_TRUSTED, T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.0 X-Spam-Checker-Version: SpamAssassin 3.3.0 (2010-01-18) on pooker.samsco.org Cc: Harald Schmalzbauer , Alexander Motin , freebsd-stable@freebsd.org, Jeremy Chadwick , Claudius Herder Subject: Re: problems with AHCI on FreeBSD 8.2 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Feb 2012 00:15:14 -0000 On Feb 14, 2012, at 4:34 PM, Victor Balada Diaz wrote: > On Tue, Feb 14, 2012 at 03:09:58PM -0800, Jeremy Chadwick wrote: >> On Tue, Feb 14, 2012 at 11:15:27PM +0100, Victor Balada Diaz wrote: >>> On Tue, Feb 14, 2012 at 06:17:19PM +0100, Harald Schmalzbauer wrote: >>>> schrieb Jeremy Chadwick am 14.02.2012 17:50 (localtime): >>>>> On Tue, Feb 14, 2012 at 04:55:10PM +0100, Claudius Herder wrote: >>>>>> Hello, >>>>>>=20 >>>>>> I have got a quite similar problem with AHCI on FreeBSD 8.2 and = it still >>>>>> persists on FreeBSD 9.0 release. >>>>>>=20 >>>>>> Switching from ahci to ataahci resolved the problem for me too. >>>>>>=20 >>>>>> I'm using gmirror for swap, system is on a zpool and the problem = first >>>>>> occurred during a zpool scrub, but it is easily reproducible with = dd. >>>>>>=20 >>>>>> The timeouts only occur when writing to disks, dd = if=3D/dev/ada{0|1} >>>>>> of=3D/dev/null is not an issue. >>>>>> Sometimes I need to power off the server because after a reboot = one disk >>>>>> is still missing. >>>>>>=20 >>>>>> I really would like to help in this issue, so let me know if you = need >>>>>> any more information. >>>>> I find it interesting that, at least so far, the only people = reporting >>>>> problems of this type with the ahci.ko driver are people using = Samsung >>>>> disks. The only difference is that your models are F1s while the = OPs >>>>> are F2s. >>>>=20 >>>> I saw such timeouts long ago and mav@ had a look at my postings and = he >>>> mentioned it could be a NCQ problem. >>>> I suspected the disks firmware. >>>> I never tracked it down further, because after replacing the = Samsung (F3 >>>> in that case) disks with hitachi ones solved all my problems and = gave a >>>> big performance kick as well (with zfs). >>>> You can find the discussion here: >>>> = http://lists.freebsd.org/pipermail/freebsd-stable/2010-February/055374.htm= l >>>>=20 >>>=20 >>> You gave me a good idea: try to disable NCQ and see if that's the = fault. So >>> i went and applied the attached patch. After it, i can no longer = reproduce >>> the issue with ahci driver. >>>=20 >>> I know this is not a solution because it disables NCQ at controller = level >>> instead of disk level, but at least we know for sure where the = problem is. >>>=20 >>> I think the solution would be to add a new quirk ADA_Q_NONCQ in = sys/cam/ata/ata_da.c. >>> Quirks infraestructure is already built, so adding a new quirk for = this seems >>> easy. >>>=20 >>> Is someone interested? Do you think there is a better solution? >>>=20 >>> If someone is interested i can build a patch to add ADA_Q_NONCQ = quirk and add my drives >>> to it. >>=20 >> I took a stab at this, but I don't feel confident this is the proper >> solution/method. I worry there's some sort of chicken-or-the-egg >> condition here (quirk setup/matching comes *after* SATA capabilities >> detection), or that it makes the code messier. Need mav@'s >> recommendations on this. >>=20 >> Below is for RELENG_8. I should note I haven't tested if this works, = or >> even compiles -- normally I don't provide such patches without = testing >> so I apologise in advance / user beware. >=20 > You're amazingly fast. Thanks for all your help :) >=20 > You start applying the quirks before=20 >=20 > snprintf(announce_buf, sizeof(announce_buf), > "kern.cam.ada.%d.quirks", periph->unit_number); > quirks =3D softc->quirks; > TUNABLE_INT_FETCH(announce_buf, &quirks); >=20 > So you're breaking quirk setting at boot time. >=20 > See my attached patch. I can confirm it works for me. >=20 > Regards. >=20 I don't think that disabling NCQ entirely is the right solution. It's a = tag starvation issue in the firmware, not a complete failure, and it can = be dealt with in the CAM XPT scheduler fairly efficiently. Alexander = and I talked about this recently, and though we differ on the details, a = tag hack is not in order, IMHO. In the short term, try just using "cam = control tags ada0 -N 1" to limit the concurrent commands to 1. Scott