Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Jun 2011 23:18:40 -0600
From:      Scott Long <scottl@samsco.org>
To:        gibbs@freebsd.org
Cc:        current@freebsd.org, "Kenneth D. Merry" <ken@freebsd.org>, Eir Nym <eirnym@gmail.com>, Kostik Belousov <kostikbel@gmail.com>, will@freebsd.org
Subject:   Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
Message-ID:  <55FDA4B1-CA5E-4304-9239-3AAF0FC6FF5F@samsco.org>
In-Reply-To: <4E03FDFD.70203@FreeBSD.org>
References:  <20110619232307.GA57530@vniz.net> <20110620001912.GA60252@vniz.net> <4DFEAD4F.1040603@FreeBSD.org> <20110620070222.GA74009@vniz.net> <20110620080146.GF48734@deviant.kiev.zoral.com.ua> <20110620114656.GA83524@vniz.net> <20110621161719.GA16166@nargothrond.kdm.org> <20110621204934.GB9877@vniz.net> <20110622035404.GA38834@nargothrond.kdm.org> <20110622041325.GA13754@vniz.net> <20110622200919.GA72504@nargothrond.kdm.org> <4E03FDFD.70203@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Jun 23, 2011, at 9:01 PM, Justin T. Gibbs wrote:

> On 6/22/11 4:09 PM, Kenneth D. Merry wrote:
>> On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
>> > On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
>> >> These two are interesting:
>> >>
>> >>> http://img825.imageshack.us/img825/1249/21062011014m.jpg
>> >>> http://img839.imageshack.us/img839/3791/21062011015.jpg
>> >>
>> >> It looks like the GEOM event thread is stuck inside the cd(4)=20
> driver. The
>> >> cd(4) driver is trying to acquire the peripheral lock, and is =
sleeping
>> >> until it gets it.
>> >>
>> >> What isn't clear is who is holding it.
>=20
> ...
>=20
>> The GEOM event thread is stuck sleeping in the mtx_sleep() call =
above. So
>> that tells me that one of several things is going on:
>>=20
>> - There is a path in the cd(4) driver where it can call =
cam_periph_hold()
>> but not cam_periph_unhold().
>>=20
>> - There is another thread in the system that has called =
cam_periph_hold(),
>> and has gotten stuck before it can call cam_periph_unhold().
>>=20
>> - The hold/unhold logic is broken, and there is a case where a thread
>> waiting for the lock can miss the wakeup. After looking at the code, =
I
>> don't think this is the case, but I may have missed something.
>>=20
>> So it is probably one of the first two cases.
>=20
> ...
>=20
> I have a theory for the cause of this hang.
>=20
> The commit that triggers this problem added calls to g_access() during =
the
> geom_dev probe.  I believe this hit a race in cdregister() where
> the periph hold lock is dropped around the changer probe code.  Why =
the
> periph hold lock is dropped there, I do not know as I haven't fully
> reviewed the changer probe code.
>=20

Are you talking about this?

        disk_create(softc->disk, DISK_VERSION);
        cam_periph_lock(periph);
        cam_periph_unhold(periph);
[...]
        if (((cgd->ccb_h.target_lun > 0)
          && ((softc->quirks & CD_Q_NO_CHANGER) =3D=3D 0))
         || ((softc->quirks & CD_Q_CHANGER) !=3D 0)) {

The unhold there compliments the hold that was done prior to =
disk_create().  The hold/unhold is done as a hack around the need to =
drop the periph/sim mutex while calling disk_create(), due to the =
later's insistence on using blocking calls.  I've wanted to re-think how =
that pattern is done (it's the same gross hack in nearly all of the =
periph drivers), but haven't gotten around to it.  If the 'hold' =
semaphore needs to be held longer to prevent the race that you're =
theorizing, then it should be possible to simply extend its coverage in =
the code block, but I'm not sure if it'll result in an unintended =
deadlock with the changer enumeration/matching code.  I _think_ that =
it'll be ok, but the density of magic in the code is a bit overwhelming =
at this time of night =3D-)

Scott






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55FDA4B1-CA5E-4304-9239-3AAF0FC6FF5F>