Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Sep 2012 21:03:26 +0930
From:      Benjamin Close <Benjamin.Close@clearchain.com>
To:        Pawel Jakub Dawidek <pjd@freebsd.org>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, "Kenneth D. Merry" <ken@freebsd.org>, jdp@freebsd.org, phk@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r240822 - head/sys/geom
Message-ID:  <5062E806.8000707@clearchain.com>
In-Reply-To: <20120926072005.GH1391@garage.freebsd.pl>
References:  <201209221241.q8MCfnhJ067937@svn.freebsd.org> <20120925233712.GA26920@nargothrond.kdm.org> <20120926072005.GH1391@garage.freebsd.pl>

next in thread | previous in thread | raw e-mail | index | archive | help
On 26/09/12 4:50 PM, Pawel Jakub Dawidek wrote:
> On Tue, Sep 25, 2012 at 05:37:12PM -0600, Kenneth D. Merry wrote:
>> On Sat, Sep 22, 2012 at 12:41:49 +0000, Pawel Jakub Dawidek wrote:
>>> Log:
>>>    Use the topology lock to protect list of providers while withering them.
>>>    It is possible that provider is destroyed while we are iterating over the
>>>    list.
> [...]
>>> --- head/sys/geom/geom_disk.c	Sat Sep 22 12:40:52 2012	(r240821)
>>> +++ head/sys/geom/geom_disk.c	Sat Sep 22 12:41:49 2012	(r240822)
>>> @@ -635,10 +635,13 @@ disk_gone(struct disk *dp)
>>>   	struct g_geom *gp;
>>>   	struct g_provider *pp;
>>>   
>>> +	g_topology_lock();
>>>   	gp = dp->d_geom;
>>> -	if (gp != NULL)
>>> +	if (gp != NULL) {
>>>   		LIST_FOREACH(pp, &gp->provider, provider)
>>>   			g_wither_provider(pp, ENXIO);
>>> +	}
>>> +	g_topology_unlock();
>>>   }
> [...]
>> This breaks devices going away in CAM.
>>
>> When the da(4) driver calls disk_gone(), it is necessarily holding the SIM
>> lock, which is a regular MTX_DEF mutex.  The GEOM topology lock is an sx
>> lock, and of WITNESS blows up because of that:
> [...]
>> disk_gone() needs to be callable from an interrupt context.  So it cannot
>> acquire the topology lock.
> I can reword what you said above a bit: disk_gone() modifies GEOM
> topology, so it has to hold the topology lock, thus it cannot be called
> from an interrupt context.
>
> We might be able to change the topology lock to LIST_FOREACH_SAFE(), as
> g_wither_provider() can only destroy current provider. This is because
> CAM own the geom and nobody should be able to mess with its provider's
> list, apart from CAM itself. So I'd need to know how CAM ensures that
> two disk_gone() cannot be called for one geom. The answer might be that
> those geoms have always only one provider, but then I cannot explain why
> we have loop there. Maybe jdp@ will remember why.
>
> I also read original commit message and I don't fully understand.
>
> It was 7 years ago so I'll quote entire commit message:
>
> 	Fix a bug that caused some /dev entries to continue to exist after
> 	the underlying drive had been hot-unplugged from the system.  Here
> 	is a specific example.  Filesystem code had opened /dev/da1s1e.
> 	Subsequently, the drive was hot-unplugged.  This (correctly) caused
> 	all of the associated /dev/da1* entries to be deleted.  When the
> 	filesystem later realized that the drive was gone it closed the
> 	device, reducing the write-access counts to 0 on the geom providers
> 	for da1s1e, da1s1, and da1.  This caused geom to re-taste the
> 	providers, resulting in the devices being created again.  When the
> 	drive was hot-plugged back in, it resulted in duplicate /dev entries
> 	for da1s1e, da1s1, and da1.
>
> 	This fix adds a new disk_gone() function which is called by CAM when a
> 	drive goes away.  It orphans all of the providers associated with the
> 	drive, setting an error condition of ENXIO in each one.  In addition,
> 	we prevent a re-taste on last close for writing if an error condition
> 	has been set in the provider.
>
> If disk is gone it should orphan its provider. This should cause
> orphaning depended providers, including da1s1 and then da1s1e.
> Orphaning sets provider's error and providers with the error set are not
> retasted. This is from g_access():
>
> 		else if (pp->acw != 0 && pp->acw == -dcw && pp->error == 0 &&
> 		    !(pp->geom->flags & G_GEOM_WITHER))
> 			g_post_event(g_new_provider_event, pp, M_WAITOK,
> 			    pp, NULL);
>
> My question is: is disk_geom() really needed? Maybe there is some small
> bug somewhere, but I think GEOM should handle all this without disk_gone().
>
I suspect the issue is with CAM/SIM and the geom commit has just exposed 
the problem.
I've been hitting hangs with device removal which appear very related to 
this:

http://lists.freebsd.org/pipermail/freebsd-usb/2012-September/011489.html





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5062E806.8000707>