Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Jun 2011 15:51:36 +0300
From:      Andriy Gapon <avg@FreeBSD.org>
To:        "Kenneth D. Merry" <ken@FreeBSD.org>
Cc:        Andrey Chernov <ache@FreeBSD.org>, current@FreeBSD.org, Eir Nym <eirnym@gmail.com>, Kostik Belousov <kostikbel@gmail.com>, "Justin T. Gibbs" <gibbs@FreeBSD.org>, will@FreeBSD.org
Subject:   Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
Message-ID:  <4E0336D8.80300@FreeBSD.org>
In-Reply-To: <20110622200919.GA72504@nargothrond.kdm.org>
References:  <20110619232307.GA57530@vniz.net> <20110620001912.GA60252@vniz.net>	<4DFEAD4F.1040603@FreeBSD.org> <20110620070222.GA74009@vniz.net>	<20110620080146.GF48734@deviant.kiev.zoral.com.ua>	<20110620114656.GA83524@vniz.net>	<20110621161719.GA16166@nargothrond.kdm.org>	<20110621204934.GB9877@vniz.net>	<20110622035404.GA38834@nargothrond.kdm.org>	<20110622041325.GA13754@vniz.net> <20110622200919.GA72504@nargothrond.kdm.org>

next in thread | previous in thread | raw e-mail | index | archive | help
on 22/06/2011 23:09 Kenneth D. Merry said the following:
> The GEOM event thread is stuck sleeping in the mtx_sleep() call above.  So
> that tells me that one of several things is going on:
> 
>  - There is a path in the cd(4) driver where it can call cam_periph_hold()
>    but not cam_periph_unhold().
> 
>  - There is another thread in the system that has called cam_periph_hold(),
>    and has gotten stuck before it can call cam_periph_unhold().
> 
>  - The hold/unhold logic is broken, and there is a case where a thread
>    waiting for the lock can miss the wakeup.  After looking at the code, I
>    don't think this is the case, but I may have missed something.
> 
> So it is probably one of the first two cases.  From the dmesg, I only see
> cd1 listed, not cd0.  So it is possible that cd0 is stuck in the probe code
> somewhere, and the geom code just gets stuck trying to open it when the
> probe hasn't completed.
> 
> Seeing the stack trace for the taskq thread that is running on CPU 0
> (process 100014) might be enlightening, it's hard to say.  That may or may
> not show the issue.
> 
> It's possible that this issue is directly related to the commit in
> question; perhaps there is an error being returned that wasn't returned
> before and it isn't being handled right in the cd(4) driver.  (The cd(4)
> driver wasn't touched in the commit.)
> 
> It's also possible that the commit in question just changed the timing and
> your system is hitting a race that was there previously.

I have a suspicion that this is actually the case.
More than once I've seen under qemu that the kernel boot non-deterministically
gets stuck in the cd driver.  Other people have also bumped into this.
E.g., here's one of the reports that I googled up, it's not exactly the same as
what ache has reported, but somewhat similar:
http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020336.html

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4E0336D8.80300>