From owner-svn-src-head@FreeBSD.ORG  Wed Sep 26 11:45:44 2012
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 115EA106564A;
	Wed, 26 Sep 2012 11:45:44 +0000 (UTC)
	(envelope-from Benjamin.Close@clearchain.com)
Received: from mail.clearchain.com (leo.clearchain.com [199.73.29.74])
	by mx1.freebsd.org (Postfix) with ESMTP id BC0BC8FC19;
	Wed, 26 Sep 2012 11:45:43 +0000 (UTC)
Received: from Lynx.local ([101.113.33.231]) (authenticated bits=0)
	by mail.clearchain.com (8.14.5/8.14.5) with ESMTP id q8QBWQAx005470
	(version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
	Wed, 26 Sep 2012 21:02:37 +0930 (CST)
	(envelope-from Benjamin.Close@clearchain.com)
Message-ID: <5062E806.8000707@clearchain.com>
Date: Wed, 26 Sep 2012 21:03:26 +0930
From: Benjamin Close <Benjamin.Close@clearchain.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
	rv:15.0) Gecko/20120907 Thunderbird/15.0.1
MIME-Version: 1.0
To: Pawel Jakub Dawidek <pjd@freebsd.org>
References: <201209221241.q8MCfnhJ067937@svn.freebsd.org>
	<20120925233712.GA26920@nargothrond.kdm.org>
	<20120926072005.GH1391@garage.freebsd.pl>
In-Reply-To: <20120926072005.GH1391@garage.freebsd.pl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(mail.clearchain.com [199.73.29.74]);
	Wed, 26 Sep 2012 21:02:41 +0930 (CST)
Cc: src-committers@freebsd.org, svn-src-all@freebsd.org,
	"Kenneth D. Merry" <ken@freebsd.org>, jdp@freebsd.org,
	phk@freebsd.org, svn-src-head@freebsd.org
Subject: Re: svn commit: r240822 - head/sys/geom
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
	<svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
	<mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
	<mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2012 11:45:44 -0000

On 26/09/12 4:50 PM, Pawel Jakub Dawidek wrote:
> On Tue, Sep 25, 2012 at 05:37:12PM -0600, Kenneth D. Merry wrote:
>> On Sat, Sep 22, 2012 at 12:41:49 +0000, Pawel Jakub Dawidek wrote:
>>> Log:
>>>    Use the topology lock to protect list of providers while withering them.
>>>    It is possible that provider is destroyed while we are iterating over the
>>>    list.
> [...]
>>> --- head/sys/geom/geom_disk.c	Sat Sep 22 12:40:52 2012	(r240821)
>>> +++ head/sys/geom/geom_disk.c	Sat Sep 22 12:41:49 2012	(r240822)
>>> @@ -635,10 +635,13 @@ disk_gone(struct disk *dp)
>>>   	struct g_geom *gp;
>>>   	struct g_provider *pp;
>>>   
>>> +	g_topology_lock();
>>>   	gp = dp->d_geom;
>>> -	if (gp != NULL)
>>> +	if (gp != NULL) {
>>>   		LIST_FOREACH(pp, &gp->provider, provider)
>>>   			g_wither_provider(pp, ENXIO);
>>> +	}
>>> +	g_topology_unlock();
>>>   }
> [...]
>> This breaks devices going away in CAM.
>>
>> When the da(4) driver calls disk_gone(), it is necessarily holding the SIM
>> lock, which is a regular MTX_DEF mutex.  The GEOM topology lock is an sx
>> lock, and of WITNESS blows up because of that:
> [...]
>> disk_gone() needs to be callable from an interrupt context.  So it cannot
>> acquire the topology lock.
> I can reword what you said above a bit: disk_gone() modifies GEOM
> topology, so it has to hold the topology lock, thus it cannot be called
> from an interrupt context.
>
> We might be able to change the topology lock to LIST_FOREACH_SAFE(), as
> g_wither_provider() can only destroy current provider. This is because
> CAM own the geom and nobody should be able to mess with its provider's
> list, apart from CAM itself. So I'd need to know how CAM ensures that
> two disk_gone() cannot be called for one geom. The answer might be that
> those geoms have always only one provider, but then I cannot explain why
> we have loop there. Maybe jdp@ will remember why.
>
> I also read original commit message and I don't fully understand.
>
> It was 7 years ago so I'll quote entire commit message:
>
> 	Fix a bug that caused some /dev entries to continue to exist after
> 	the underlying drive had been hot-unplugged from the system.  Here
> 	is a specific example.  Filesystem code had opened /dev/da1s1e.
> 	Subsequently, the drive was hot-unplugged.  This (correctly) caused
> 	all of the associated /dev/da1* entries to be deleted.  When the
> 	filesystem later realized that the drive was gone it closed the
> 	device, reducing the write-access counts to 0 on the geom providers
> 	for da1s1e, da1s1, and da1.  This caused geom to re-taste the
> 	providers, resulting in the devices being created again.  When the
> 	drive was hot-plugged back in, it resulted in duplicate /dev entries
> 	for da1s1e, da1s1, and da1.
>
> 	This fix adds a new disk_gone() function which is called by CAM when a
> 	drive goes away.  It orphans all of the providers associated with the
> 	drive, setting an error condition of ENXIO in each one.  In addition,
> 	we prevent a re-taste on last close for writing if an error condition
> 	has been set in the provider.
>
> If disk is gone it should orphan its provider. This should cause
> orphaning depended providers, including da1s1 and then da1s1e.
> Orphaning sets provider's error and providers with the error set are not
> retasted. This is from g_access():
>
> 		else if (pp->acw != 0 && pp->acw == -dcw && pp->error == 0 &&
> 		    !(pp->geom->flags & G_GEOM_WITHER))
> 			g_post_event(g_new_provider_event, pp, M_WAITOK,
> 			    pp, NULL);
>
> My question is: is disk_geom() really needed? Maybe there is some small
> bug somewhere, but I think GEOM should handle all this without disk_gone().
>
I suspect the issue is with CAM/SIM and the geom commit has just exposed 
the problem.
I've been hitting hangs with device removal which appear very related to 
this:

http://lists.freebsd.org/pipermail/freebsd-usb/2012-September/011489.html