Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 23 Jun 2011 16:54:44 +0400
From:      Andrey Chernov <ache@FreeBSD.ORG>
To:        "Kenneth D. Merry" <ken@FreeBSD.ORG>, will@FreeBSD.ORG
Cc:        Kostik Belousov <kostikbel@gmail.com>, Eir Nym <eirnym@gmail.com>, "Justin T. Gibbs" <gibbs@FreeBSD.ORG>, current@FreeBSD.ORG, will@FreeBSD.ORG
Subject:   Re: Exactly that commit (was Re: Latest -current 100% hang at the late boot stage)
Message-ID:  <20110623125443.GA42879@vniz.net>
In-Reply-To: <20110622200919.GA72504@nargothrond.kdm.org>
References:  <20110620001912.GA60252@vniz.net> <4DFEAD4F.1040603@FreeBSD.org> <20110620070222.GA74009@vniz.net> <20110620080146.GF48734@deviant.kiev.zoral.com.ua> <20110620114656.GA83524@vniz.net> <20110621161719.GA16166@nargothrond.kdm.org> <20110621204934.GB9877@vniz.net> <20110622035404.GA38834@nargothrond.kdm.org> <20110622041325.GA13754@vniz.net> <20110622200919.GA72504@nargothrond.kdm.org>

next in thread | previous in thread | raw e-mail | index | archive | help

Apparently there is another problem plain ATA CD/DVD related. With r223443
hangs nature is changed: I see no more waiting in "caplck" state, just 
xpt_thrd waiting in "ccb_scan" state forever and those repeated messages:
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
run_interrupt_driven_hooks: still waiting after 120 seconds for xpt_config
...
and so on.

On Wed, Jun 22, 2011 at 02:09:19PM -0600, Kenneth D. Merry wrote:
> On Wed, Jun 22, 2011 at 08:13:25 +0400, Andrey Chernov wrote:
> > On Tue, Jun 21, 2011 at 09:54:04PM -0600, Kenneth D. Merry wrote:
> > > These two are interesting:
> > > 
> > > > http://img825.imageshack.us/img825/1249/21062011014m.jpg
> > > > http://img839.imageshack.us/img839/3791/21062011015.jpg
> > > 
> > > It looks like the GEOM event thread is stuck inside the cd(4) driver.  The
> > > cd(4) driver is trying to acquire the peripheral lock, and is sleeping
> > > until it gets it.
> > > 
> > > What isn't clear is who is holding it.  The ps output shows an idle thread
> > > running on CPU 1, and thread 100014 (taskq) running on CPU 0.
> > > Unfortunately I don't see a stack trace for that.  (I might have missed
> > > it.)
> > > 
> > > Do you happen to have the image with the stack trace for that thread?
> > 
> > I don't have the image because no disks are mounted at that stage and the 
> > swap slice is not attached. But I can issue more specific DDB commands to 
> > narrow it down, just say what you need in detail.
> > 
> > BTW, the machine have 2 DVD both are attached to Marvell IDE plain ATA 
> > interface, they always works before.
> > 
> > Are you sure that something holding the lock? 'show lock' shows absolutely 
> > nothing, it is empty.
> 
> Well, after looking at the code a little more, it looks like the "lock"
> that is being held is the periph lock, which is really just a flag.
> So 'show lock' wouldn't show anything relevant.  Here's cam_periph_hold():
> 
> int
> cam_periph_hold(struct cam_periph *periph, int priority)
> {
> 	int error;
> 
> 	/*
> 	 * Increment the reference count on the peripheral
> 	 * while we wait for our lock attempt to succeed
> 	 * to ensure the peripheral doesn't disappear out
> 	 * from user us while we sleep.
> 	 */
> 
> 	if (cam_periph_acquire(periph) != CAM_REQ_CMP)
> 		return (ENXIO);
> 
> 	mtx_assert(periph->sim->mtx, MA_OWNED);
> 	while ((periph->flags & CAM_PERIPH_LOCKED) != 0) {
> 		periph->flags |= CAM_PERIPH_LOCK_WANTED;
> 		if ((error = mtx_sleep(periph, periph->sim->mtx, priority,
> 		     "caplck", 0)) != 0) {
> 			cam_periph_release_locked(periph);
> 			return (error);
> 		}
> 	}
> 
> 	periph->flags |= CAM_PERIPH_LOCKED;
> 	return (0);
> }
> 
> The GEOM event thread is stuck sleeping in the mtx_sleep() call above.  So
> that tells me that one of several things is going on:
> 
>  - There is a path in the cd(4) driver where it can call cam_periph_hold()
>    but not cam_periph_unhold().
> 
>  - There is another thread in the system that has called cam_periph_hold(),
>    and has gotten stuck before it can call cam_periph_unhold().
> 
>  - The hold/unhold logic is broken, and there is a case where a thread
>    waiting for the lock can miss the wakeup.  After looking at the code, I
>    don't think this is the case, but I may have missed something.
> 
> So it is probably one of the first two cases.  From the dmesg, I only see
> cd1 listed, not cd0.  So it is possible that cd0 is stuck in the probe code
> somewhere, and the geom code just gets stuck trying to open it when the
> probe hasn't completed.
> 
> Seeing the stack trace for the taskq thread that is running on CPU 0
> (process 100014) might be enlightening, it's hard to say.  That may or may
> not show the issue.
> 
> It's possible that this issue is directly related to the commit in
> question; perhaps there is an error being returned that wasn't returned
> before and it isn't being handled right in the cd(4) driver.  (The cd(4)
> driver wasn't touched in the commit.)
> 
> It's also possible that the commit in question just changed the timing and
> your system is hitting a race that was there previously.
> 
> Ken
> -- 
> Kenneth Merry
> ken@FreeBSD.ORG


-- 
http://ache.vniz.net/



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110623125443.GA42879>