From owner-freebsd-stable@freebsd.org Wed Mar 2 20:25:59 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8AC41AC2C86 for ; Wed, 2 Mar 2016 20:25:59 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6C6011C13 for ; Wed, 2 Mar 2016 20:25:59 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 80BF1B96B; Wed, 2 Mar 2016 15:25:58 -0500 (EST) From: John Baldwin To: Harry Schmalzbauer Cc: FreeBSD Stable , Mark Saad Subject: Re: ahci-timeout regression in beta3 Date: Wed, 02 Mar 2016 09:32:41 -0800 Message-ID: <10965531.zQdbXDLmAc@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) In-Reply-To: <56D48DEF.5020304@omnilan.de> References: <56D350B6.6090906@omnilan.de> <56D48DEF.5020304@omnilan.de> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 02 Mar 2016 15:25:58 -0500 (EST) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Mar 2016 20:25:59 -0000 On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote: > Bez=C3=BCglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (= localtime): > > Hello, > > > > I have a remote machine with a probably defective ODD, but until r2= 94989 > > (from Jan 28th) I could boot with just these warnings: > > (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00 > > (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error > > (cd1:ahcich1:0:0:0): SCSI status: Check Condition > > (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered= read > > error) > > (cd1:ahcich1:0:0:0): Error 5, Unretryable error > > (cd1:ahcich1:0:0:0): cddone: got error 0x5 back > > =E2=80=A6 > > > > beta3 doesn't boot anymore, it's hanging with ahci-timeouts: > > ahcich2: Timeout on slot 11 port 0 > > ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 der= r > > 00000000 cmd 0004cb17 > > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d = 01 00 > > 00 00 00 > > ... > > (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 = 00 00 > > 00 00 > > (aprobe0:ahcich2:0:0:0) CAM status: Command timeout > > (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked > > ada1 detached > > ... > > The numbers (first ACB) and also the channel varies from time to ti= me >=20 > I could narrow it down to r295480 > (https://svnweb.freebsd.org/base?view=3Drevision&revision=3D295480) >=20 > Reverting that lets the machine boot again. >=20 > I captured verbose boot messages, finding out that problem relaxes wi= th > verbose-booting, since ahci seems to recover: > =E2=80=A6 > TSC timecounter discards lower 1 bit(s) > Timecounter "TSC-low" frequency 1746033500 Hz quality -100 > ahcich2: Timeout on slot 12 port 0 > ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr > 00000000 cmd 0004cc17 > ahcich2: AHCI reset... > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01= 00 > 00 00 00 > (ada1:ahcich2:0:0:0): CAM status: Command timeout > (ada1:ahcich2:0:0:0): Retrying command > ahcich2: SATA connect time=3D100us status=3D00000123 > ahcich2: AHCI reset: device found > ahcich2: AHCI reset: device ready after 100ms > ahcich1: SNTF 0x0001 > ahcich1: SNTF 0x0001 > =E2=80=A6 >=20 > I have checked twice that r295480 introduces boot failure here. >=20 > I have absolutely no idea where/how/why/what race happens... >=20 > Thanks for any hints, That is most bizarre. Does HEAD boot fine on this machine? The change= in question probably alters the timing of startup a bit since the rando= m kthread is placed on the run queue later which might affect the relativ= e order of kthreads as they start executing, but that would just mean it = is exposting a race in some other part of the system. --=20 John Baldwin