Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 05 Mar 2016 13:11:13 +0100
From:      Harry Schmalzbauer <freebsd@omnilan.de>
To:        John Baldwin <jhb@freebsd.org>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>, Mark Saad <nonesuch@longcount.org>
Subject:   Re: ahci-timeout regression in beta3
Message-ID:  <56DACCE1.80606@omnilan.de>
In-Reply-To: <10965531.zQdbXDLmAc@ralph.baldwin.cx>
References:  <56D350B6.6090906@omnilan.de> <56D48DEF.5020304@omnilan.de> <10965531.zQdbXDLmAc@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich John Baldwin's Nachricht vom 02.03.2016 18:32 (localtime):
> On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote:
>>  Bezüglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime):
>>>  Hello,
>>>
>>> I have a remote machine with a probably defective ODD, but until r294989
>>> (from Jan 28th) I could boot with just these warnings:
>>> (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00
>>> (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error
>>> (cd1:ahcich1:0:0:0): SCSI status: Check Condition
>>> (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read
>>> error)
>>> (cd1:ahcich1:0:0:0): Error 5, Unretryable error
>>> (cd1:ahcich1:0:0:0): cddone: got error 0x5 back
>>> …
>>>
>>> beta3 doesn't boot anymore, it's hanging with ahci-timeouts:
>>> ahcich2: Timeout on slot 11 port 0
>>> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr
>>> 00000000 cmd 0004cb17
>>> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00
>>> 00 00 00
>>> ...
>>> (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00
>>> 00 00
>>> (aprobe0:ahcich2:0:0:0) CAM status: Command timeout
>>> (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked
>>> ada1 detached
>>> ...
>>> The numbers (first ACB) and also the channel varies from time to time
>>
>> I could narrow it down to r295480
>> (https://svnweb.freebsd.org/base?view=revision&revision=295480)
>>
>> Reverting that lets the machine boot again.
>>
>> I captured verbose boot messages, finding out that problem relaxes with
>> verbose-booting, since ahci seems to recover:
>> …
>> TSC timecounter discards lower 1 bit(s)
>> Timecounter "TSC-low" frequency 1746033500 Hz quality -100
>> ahcich2: Timeout on slot 12 port 0
>> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr
>> 00000000 cmd 0004cc17
>> ahcich2: AHCI reset...
>> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00
>> 00 00 00
>> (ada1:ahcich2:0:0:0): CAM status: Command timeout
>> (ada1:ahcich2:0:0:0): Retrying command
>> ahcich2: SATA connect time=100us status=00000123
>> ahcich2: AHCI reset: device found
>> ahcich2: AHCI reset: device ready after 100ms
>> ahcich1: SNTF 0x0001
>> ahcich1: SNTF 0x0001
>> …
>>
>> I have checked twice that r295480 introduces boot failure here.
>>
>> I have absolutely no idea where/how/why/what race happens...
>>
>> Thanks for any hints,
> 
> That is most bizarre.  Does HEAD boot fine on this machine?  The change
> in question probably alters the timing of startup a bit since the random
> kthread is placed on the run queue later which might affect the relative
> order of kthreads as they start executing, but that would just mean it is
> exposting a race in some other part of the system.


Bizarre it is...
HEAD (r295683, 02/17/2016) boots fine.

BETA3 fails.

This time I checked with vendor-ISO images, while before it was a custom
setup rollout with local patches and special hw adaptions. Now I'm sure
it's not site-related.

With BETA3-iso, where booting fails, "random: unblocking device."
happens after timecounter initialization and before attaching ses0/cdX/adaX.
With HEAD-iso, where booting succeeds, "random: unblocking device."
happens way after ses0/adaX/cdX attached, right before rc.

On HEAD, ahci-devices attach in the same order as with -stable pre-r295480.
Since r295480, cdX attaches before adaX on -stable and while searching
for the cluprit, I had observed that attaching-order was a clear
indicator whether machine boots or not.

Sorry, I can't provide more useful info at this time, I just can
describe simple symptoms :-(


While playing with that machine I remember having had such a bizarre
problem before, arising with r284665. Here's what I found in my kernel
config:
# Don't build kernel with RACCT by default, which was enabled with r284665,
# since ahci(4) fails when using MSI with ahcichX timeout!
nooptions               RACCT           # Resource accounting framework
nooptions               RACCT_DEFAULT_TO_DISABLED # Set
kern.racct.enable=0 by default
nooptions               RCTL            # Resource limits

Perhpas it's related?!
https://lists.freebsd.org/pipermail/freebsd-stable/2015-July/082706.html

Thanks,

-Harry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56DACCE1.80606>