Date: Sat, 05 Mar 2016 23:41:03 +0100 From: Harry Schmalzbauer <freebsd@omnilan.de> To: John Baldwin <jhb@freebsd.org> Cc: FreeBSD Stable <freebsd-stable@freebsd.org>, Mark Saad <nonesuch@longcount.org> Subject: Re: ahci-timeout regression in beta3 Message-ID: <56DB607F.8010205@omnilan.de> In-Reply-To: <6707098.Q23tZsJLSN@ralph.baldwin.cx> References: <56D350B6.6090906@omnilan.de> <10965531.zQdbXDLmAc@ralph.baldwin.cx> <56DACCE1.80606@omnilan.de> <6707098.Q23tZsJLSN@ralph.baldwin.cx>
next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich John Baldwin's Nachricht vom 05.03.2016 22:50 (localtime): > On Saturday, March 05, 2016 01:11:13 PM Harry Schmalzbauer wrote: >> Bezüglich John Baldwin's Nachricht vom 02.03.2016 18:32 (localtime): … >> With BETA3-iso, where booting fails, "random: unblocking device." >> happens after timecounter initialization and before attaching ses0/cdX/adaX. >> With HEAD-iso, where booting succeeds, "random: unblocking device." >> happens way after ses0/adaX/cdX attached, right before rc. > > Yes, HEAD's /dev/random has many more changes than were put into 10 for > BETA3. > >> On HEAD, ahci-devices attach in the same order as with -stable pre-r295480. >> Since r295480, cdX attaches before adaX on -stable and while searching >> for the cluprit, I had observed that attaching-order was a clear >> indicator whether machine boots or not. … >> Perhpas it's related?! >> https://lists.freebsd.org/pipermail/freebsd-stable/2015-July/082706.html > > I think it's related in the sense that there is a timing race in ahci and > that the /dev/random and RACCT changes alter the timing enough to trigger > the race simply by changing the relative order of SYSINIT's during boot > (and/or the amount of time between the ahci driver doing its initial > probe and the second probe that is run for the interrupt config hooks that > actually probes the attached SATA devices). Thanks for your comment, I had such kind of race in mind, but I don't have the skills to debug myself - then and now and unfortunately also not the time for an upgrade ;-) But meanwhile I deployed 10.3-RC1 without reverting r295480 (and also removing "nooptions RACCT" (+ RCTL), since effectless »kern.racct.enable« was corrected some time after that problem hit me). Good news is that these ahci-timeouts haven't showed up elsewhere yet – I've updated several _very_ similar setups (C200 chipsets; but none with a suspicious faulty ODD) So it's clearly not a show stopper for 10.3. But there's a timing race to find, which affects ahci-timeouts. The most nasty one's I ever fought... And it's not very welcome finding a remote machine stop booting because of a faulty ODD one wasn't ware, since it succeeds booting previous FreeBSD release and other OSs. Tell me if I can help out with my skills. Thanks, -Harry
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?56DB607F.8010205>