Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 29 Feb 2016 19:29:03 +0100
From:      Harry Schmalzbauer <freebsd@omnilan.de>
To:        FreeBSD Stable <freebsd-stable@freebsd.org>
Cc:        Mark Saad <nonesuch@longcount.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: ahci-timeout regression in beta3
Message-ID:  <56D48DEF.5020304@omnilan.de>
In-Reply-To: <56D350B6.6090906@omnilan.de>
References:  <56D350B6.6090906@omnilan.de>

next in thread | previous in thread | raw e-mail | index | archive | help
 Bezüglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime):
>  Hello,
>
> I have a remote machine with a probably defective ODD, but until r294989
> (from Jan 28th) I could boot with just these warnings:
> (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00
> (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error
> (cd1:ahcich1:0:0:0): SCSI status: Check Condition
> (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read
> error)
> (cd1:ahcich1:0:0:0): Error 5, Unretryable error
> (cd1:ahcich1:0:0:0): cddone: got error 0x5 back
> …
>
> beta3 doesn't boot anymore, it's hanging with ahci-timeouts:
> ahcich2: Timeout on slot 11 port 0
> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr
> 00000000 cmd 0004cb17
> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00
> 00 00 00
> ...
> (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00
> 00 00
> (aprobe0:ahcich2:0:0:0) CAM status: Command timeout
> (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked
> ada1 detached
> ...
> The numbers (first ACB) and also the channel varies from time to time

I could narrow it down to r295480
(https://svnweb.freebsd.org/base?view=revision&revision=295480)

Reverting that lets the machine boot again.

I captured verbose boot messages, finding out that problem relaxes with
verbose-booting, since ahci seems to recover:
…
TSC timecounter discards lower 1 bit(s)
Timecounter "TSC-low" frequency 1746033500 Hz quality -100
ahcich2: Timeout on slot 12 port 0
ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr
00000000 cmd 0004cc17
ahcich2: AHCI reset...
(ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00
00 00 00
(ada1:ahcich2:0:0:0): CAM status: Command timeout
(ada1:ahcich2:0:0:0): Retrying command
ahcich2: SATA connect time=100us status=00000123
ahcich2: AHCI reset: device found
ahcich2: AHCI reset: device ready after 100ms
ahcich1: SNTF 0x0001
ahcich1: SNTF 0x0001
…

I have checked twice that r295480 introduces boot failure here.

I have absolutely no idea where/how/why/what race happens...

Thanks for any hints,

-Harry




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56D48DEF.5020304>