From owner-freebsd-stable@freebsd.org  Wed Mar  2 20:25:59 2016
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8AC41AC2C86
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Wed,  2 Mar 2016 20:25:59 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6C6011C13
 for <freebsd-stable@freebsd.org>; Wed,  2 Mar 2016 20:25:59 +0000 (UTC)
 (envelope-from jhb@freebsd.org)
Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net
 [73.231.226.104])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 80BF1B96B;
 Wed,  2 Mar 2016 15:25:58 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: Harry Schmalzbauer <freebsd@omnilan.de>
Cc: FreeBSD Stable <freebsd-stable@freebsd.org>,
 Mark Saad <nonesuch@longcount.org>
Subject: Re: ahci-timeout regression in beta3
Date: Wed, 02 Mar 2016 09:32:41 -0800
Message-ID: <10965531.zQdbXDLmAc@ralph.baldwin.cx>
User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; )
In-Reply-To: <56D48DEF.5020304@omnilan.de>
References: <56D350B6.6090906@omnilan.de> <56D48DEF.5020304@omnilan.de>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 02 Mar 2016 15:25:58 -0500 (EST)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Mar 2016 20:25:59 -0000

On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote:
>  Bez=C3=BCglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (=
localtime):
> >  Hello,
> >
> > I have a remote machine with a probably defective ODD, but until r2=
94989
> > (from Jan 28th) I could boot with just these warnings:
> > (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00
> > (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error
> > (cd1:ahcich1:0:0:0): SCSI status: Check Condition
> > (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered=
 read
> > error)
> > (cd1:ahcich1:0:0:0): Error 5, Unretryable error
> > (cd1:ahcich1:0:0:0): cddone: got error 0x5 back
> > =E2=80=A6
> >
> > beta3 doesn't boot anymore, it's hanging with ahci-timeouts:
> > ahcich2: Timeout on slot 11 port 0
> > ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 der=
r
> > 00000000 cmd 0004cb17
> > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d =
01 00
> > 00 00 00
> > ...
> > (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 =
00 00
> > 00 00
> > (aprobe0:ahcich2:0:0:0) CAM status: Command timeout
> > (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked
> > ada1 detached
> > ...
> > The numbers (first ACB) and also the channel varies from time to ti=
me
>=20
> I could narrow it down to r295480
> (https://svnweb.freebsd.org/base?view=3Drevision&revision=3D295480)
>=20
> Reverting that lets the machine boot again.
>=20
> I captured verbose boot messages, finding out that problem relaxes wi=
th
> verbose-booting, since ahci seems to recover:
> =E2=80=A6
> TSC timecounter discards lower 1 bit(s)
> Timecounter "TSC-low" frequency 1746033500 Hz quality -100
> ahcich2: Timeout on slot 12 port 0
> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr
> 00000000 cmd 0004cc17
> ahcich2: AHCI reset...
> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01=
 00
> 00 00 00
> (ada1:ahcich2:0:0:0): CAM status: Command timeout
> (ada1:ahcich2:0:0:0): Retrying command
> ahcich2: SATA connect time=3D100us status=3D00000123
> ahcich2: AHCI reset: device found
> ahcich2: AHCI reset: device ready after 100ms
> ahcich1: SNTF 0x0001
> ahcich1: SNTF 0x0001
> =E2=80=A6
>=20
> I have checked twice that r295480 introduces boot failure here.
>=20
> I have absolutely no idea where/how/why/what race happens...
>=20
> Thanks for any hints,

That is most bizarre.  Does HEAD boot fine on this machine?  The change=

in question probably alters the timing of startup a bit since the rando=
m
kthread is placed on the run queue later which might affect the relativ=
e
order of kthreads as they start executing, but that would just mean it =
is
exposting a race in some other part of the system.

--=20
John Baldwin