Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 02 Mar 2016 09:32:41 -0800
From:      John Baldwin <jhb@freebsd.org>
To:        Harry Schmalzbauer <freebsd@omnilan.de>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>, Mark Saad <nonesuch@longcount.org>
Subject:   Re: ahci-timeout regression in beta3
Message-ID:  <10965531.zQdbXDLmAc@ralph.baldwin.cx>
In-Reply-To: <56D48DEF.5020304@omnilan.de>
References:  <56D350B6.6090906@omnilan.de> <56D48DEF.5020304@omnilan.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote:
>  Bez=C3=BCglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (=
localtime):
> >  Hello,
> >
> > I have a remote machine with a probably defective ODD, but until r2=
94989
> > (from Jan 28th) I could boot with just these warnings:
> > (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00
> > (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error
> > (cd1:ahcich1:0:0:0): SCSI status: Check Condition
> > (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered=
 read
> > error)
> > (cd1:ahcich1:0:0:0): Error 5, Unretryable error
> > (cd1:ahcich1:0:0:0): cddone: got error 0x5 back
> > =E2=80=A6
> >
> > beta3 doesn't boot anymore, it's hanging with ahci-timeouts:
> > ahcich2: Timeout on slot 11 port 0
> > ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 der=
r
> > 00000000 cmd 0004cb17
> > (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d =
01 00
> > 00 00 00
> > ...
> > (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 =
00 00
> > 00 00
> > (aprobe0:ahcich2:0:0:0) CAM status: Command timeout
> > (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked
> > ada1 detached
> > ...
> > The numbers (first ACB) and also the channel varies from time to ti=
me
>=20
> I could narrow it down to r295480
> (https://svnweb.freebsd.org/base?view=3Drevision&revision=3D295480)
>=20
> Reverting that lets the machine boot again.
>=20
> I captured verbose boot messages, finding out that problem relaxes wi=
th
> verbose-booting, since ahci seems to recover:
> =E2=80=A6
> TSC timecounter discards lower 1 bit(s)
> Timecounter "TSC-low" frequency 1746033500 Hz quality -100
> ahcich2: Timeout on slot 12 port 0
> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr
> 00000000 cmd 0004cc17
> ahcich2: AHCI reset...
> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01=
 00
> 00 00 00
> (ada1:ahcich2:0:0:0): CAM status: Command timeout
> (ada1:ahcich2:0:0:0): Retrying command
> ahcich2: SATA connect time=3D100us status=3D00000123
> ahcich2: AHCI reset: device found
> ahcich2: AHCI reset: device ready after 100ms
> ahcich1: SNTF 0x0001
> ahcich1: SNTF 0x0001
> =E2=80=A6
>=20
> I have checked twice that r295480 introduces boot failure here.
>=20
> I have absolutely no idea where/how/why/what race happens...
>=20
> Thanks for any hints,

That is most bizarre.  Does HEAD boot fine on this machine?  The change=

in question probably alters the timing of startup a bit since the rando=
m
kthread is placed on the run queue later which might affect the relativ=
e
order of kthreads as they start executing, but that would just mean it =
is
exposting a race in some other part of the system.

--=20
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?10965531.zQdbXDLmAc>