From owner-freebsd-stable@freebsd.org Mon Feb 29 18:29:08 2016 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A223AAB8C9D for ; Mon, 29 Feb 2016 18:29:08 +0000 (UTC) (envelope-from freebsd@omnilan.de) Received: from mx0.gentlemail.de (mx0.gentlemail.de [IPv6:2a00:e10:2800::a130]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 257901E44; Mon, 29 Feb 2016 18:29:07 +0000 (UTC) (envelope-from freebsd@omnilan.de) Received: from mh0.gentlemail.de (mh0.gentlemail.de [78.138.80.135]) by mx0.gentlemail.de (8.14.5/8.14.5) with ESMTP id u1TIT4SB079906; Mon, 29 Feb 2016 19:29:04 +0100 (CET) (envelope-from freebsd@omnilan.de) Received: from titan.inop.mo1.omnilan.net (titan.inop.mo1.omnilan.net [IPv6:2001:a60:f0bb:1::3:1]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mh0.gentlemail.de (Postfix) with ESMTPSA id A94AB968; Mon, 29 Feb 2016 19:29:03 +0100 (CET) Message-ID: <56D48DEF.5020304@omnilan.de> Date: Mon, 29 Feb 2016 19:29:03 +0100 From: Harry Schmalzbauer Organization: OmniLAN User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; de-DE; rv:1.9.2.8) Gecko/20100906 Lightning/1.0b2 Thunderbird/3.1.2 MIME-Version: 1.0 To: FreeBSD Stable CC: Mark Saad , John Baldwin Subject: Re: ahci-timeout regression in beta3 References: <56D350B6.6090906@omnilan.de> In-Reply-To: <56D350B6.6090906@omnilan.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Greylist: ACL 119 matched, not delayed by milter-greylist-4.2.7 (mx0.gentlemail.de [78.138.80.130]); Mon, 29 Feb 2016 19:29:04 +0100 (CET) X-Milter: Spamilter (Reciever: mx0.gentlemail.de; Sender-ip: 78.138.80.135; Sender-helo: mh0.gentlemail.de; ) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Feb 2016 18:29:08 -0000 Bezüglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime): > Hello, > > I have a remote machine with a probably defective ODD, but until r294989 > (from Jan 28th) I could boot with just these warnings: > (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00 > (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error > (cd1:ahcich1:0:0:0): SCSI status: Check Condition > (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read > error) > (cd1:ahcich1:0:0:0): Error 5, Unretryable error > (cd1:ahcich1:0:0:0): cddone: got error 0x5 back > … > > beta3 doesn't boot anymore, it's hanging with ahci-timeouts: > ahcich2: Timeout on slot 11 port 0 > ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr > 00000000 cmd 0004cb17 > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00 > 00 00 00 > ... > (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00 > 00 00 > (aprobe0:ahcich2:0:0:0) CAM status: Command timeout > (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked > ada1 detached > ... > The numbers (first ACB) and also the channel varies from time to time I could narrow it down to r295480 (https://svnweb.freebsd.org/base?view=revision&revision=295480) Reverting that lets the machine boot again. I captured verbose boot messages, finding out that problem relaxes with verbose-booting, since ahci seems to recover: … TSC timecounter discards lower 1 bit(s) Timecounter "TSC-low" frequency 1746033500 Hz quality -100 ahcich2: Timeout on slot 12 port 0 ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr 00000000 cmd 0004cc17 ahcich2: AHCI reset... (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00 00 00 00 (ada1:ahcich2:0:0:0): CAM status: Command timeout (ada1:ahcich2:0:0:0): Retrying command ahcich2: SATA connect time=100us status=00000123 ahcich2: AHCI reset: device found ahcich2: AHCI reset: device ready after 100ms ahcich1: SNTF 0x0001 ahcich1: SNTF 0x0001 … I have checked twice that r295480 introduces boot failure here. I have absolutely no idea where/how/why/what race happens... Thanks for any hints, -Harry