Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 05 Mar 2016 23:41:03 +0100
From:      Harry Schmalzbauer <freebsd@omnilan.de>
To:        John Baldwin <jhb@freebsd.org>
Cc:        FreeBSD Stable <freebsd-stable@freebsd.org>, Mark Saad <nonesuch@longcount.org>
Subject:   Re: ahci-timeout regression in beta3
Message-ID:  <56DB607F.8010205@omnilan.de>
In-Reply-To: <6707098.Q23tZsJLSN@ralph.baldwin.cx>
References:  <56D350B6.6090906@omnilan.de> <10965531.zQdbXDLmAc@ralph.baldwin.cx> <56DACCE1.80606@omnilan.de> <6707098.Q23tZsJLSN@ralph.baldwin.cx>

next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich John Baldwin's Nachricht vom 05.03.2016 22:50 (localtime):
> On Saturday, March 05, 2016 01:11:13 PM Harry Schmalzbauer wrote:
>> Bezüglich John Baldwin's Nachricht vom 02.03.2016 18:32 (localtime):
…
>> With BETA3-iso, where booting fails, "random: unblocking device."
>> happens after timecounter initialization and before attaching ses0/cdX/adaX.
>> With HEAD-iso, where booting succeeds, "random: unblocking device."
>> happens way after ses0/adaX/cdX attached, right before rc.
> 
> Yes, HEAD's /dev/random has many more changes than were put into 10 for
> BETA3.
> 
>> On HEAD, ahci-devices attach in the same order as with -stable pre-r295480.
>> Since r295480, cdX attaches before adaX on -stable and while searching
>> for the cluprit, I had observed that attaching-order was a clear
>> indicator whether machine boots or not.
…
>> Perhpas it's related?!
>> https://lists.freebsd.org/pipermail/freebsd-stable/2015-July/082706.html
> 
> I think it's related in the sense that there is a timing race in ahci and
> that the /dev/random and RACCT changes alter the timing enough to trigger
> the race simply by changing the relative order of SYSINIT's during boot
> (and/or the amount of time between the ahci driver doing its initial
> probe and the second probe that is run for the interrupt config hooks that
> actually probes the attached SATA devices).


Thanks for your comment, I had such kind of race in mind, but I don't
have the skills to debug myself - then and now and unfortunately also
not the time for an upgrade ;-)

But meanwhile I deployed 10.3-RC1 without reverting r295480 (and also
removing "nooptions RACCT" (+ RCTL), since effectless
»kern.racct.enable« was corrected some time after that problem hit me).

Good news is that these ahci-timeouts haven't showed up elsewhere yet –
I've updated several _very_ similar setups (C200 chipsets; but none with
a suspicious faulty ODD)

So it's clearly not a show stopper for 10.3.

But there's a timing race to find, which affects ahci-timeouts. The most
nasty one's I ever fought... And it's not very welcome finding a remote
machine stop booting because of a faulty ODD one wasn't ware, since it
succeeds booting previous FreeBSD release and other OSs.

Tell me if I can help out with my skills.

Thanks,

-Harry




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56DB607F.8010205>