Date: Wed, 26 Sep 2012 21:03:26 +0930 From: Benjamin Close <Benjamin.Close@clearchain.com> To: Pawel Jakub Dawidek <pjd@freebsd.org> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, "Kenneth D. Merry" <ken@freebsd.org>, jdp@freebsd.org, phk@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r240822 - head/sys/geom Message-ID: <5062E806.8000707@clearchain.com> In-Reply-To: <20120926072005.GH1391@garage.freebsd.pl> References: <201209221241.q8MCfnhJ067937@svn.freebsd.org> <20120925233712.GA26920@nargothrond.kdm.org> <20120926072005.GH1391@garage.freebsd.pl>
next in thread | previous in thread | raw e-mail | index | archive | help
On 26/09/12 4:50 PM, Pawel Jakub Dawidek wrote: > On Tue, Sep 25, 2012 at 05:37:12PM -0600, Kenneth D. Merry wrote: >> On Sat, Sep 22, 2012 at 12:41:49 +0000, Pawel Jakub Dawidek wrote: >>> Log: >>> Use the topology lock to protect list of providers while withering them. >>> It is possible that provider is destroyed while we are iterating over the >>> list. > [...] >>> --- head/sys/geom/geom_disk.c Sat Sep 22 12:40:52 2012 (r240821) >>> +++ head/sys/geom/geom_disk.c Sat Sep 22 12:41:49 2012 (r240822) >>> @@ -635,10 +635,13 @@ disk_gone(struct disk *dp) >>> struct g_geom *gp; >>> struct g_provider *pp; >>> >>> + g_topology_lock(); >>> gp = dp->d_geom; >>> - if (gp != NULL) >>> + if (gp != NULL) { >>> LIST_FOREACH(pp, &gp->provider, provider) >>> g_wither_provider(pp, ENXIO); >>> + } >>> + g_topology_unlock(); >>> } > [...] >> This breaks devices going away in CAM. >> >> When the da(4) driver calls disk_gone(), it is necessarily holding the SIM >> lock, which is a regular MTX_DEF mutex. The GEOM topology lock is an sx >> lock, and of WITNESS blows up because of that: > [...] >> disk_gone() needs to be callable from an interrupt context. So it cannot >> acquire the topology lock. > I can reword what you said above a bit: disk_gone() modifies GEOM > topology, so it has to hold the topology lock, thus it cannot be called > from an interrupt context. > > We might be able to change the topology lock to LIST_FOREACH_SAFE(), as > g_wither_provider() can only destroy current provider. This is because > CAM own the geom and nobody should be able to mess with its provider's > list, apart from CAM itself. So I'd need to know how CAM ensures that > two disk_gone() cannot be called for one geom. The answer might be that > those geoms have always only one provider, but then I cannot explain why > we have loop there. Maybe jdp@ will remember why. > > I also read original commit message and I don't fully understand. > > It was 7 years ago so I'll quote entire commit message: > > Fix a bug that caused some /dev entries to continue to exist after > the underlying drive had been hot-unplugged from the system. Here > is a specific example. Filesystem code had opened /dev/da1s1e. > Subsequently, the drive was hot-unplugged. This (correctly) caused > all of the associated /dev/da1* entries to be deleted. When the > filesystem later realized that the drive was gone it closed the > device, reducing the write-access counts to 0 on the geom providers > for da1s1e, da1s1, and da1. This caused geom to re-taste the > providers, resulting in the devices being created again. When the > drive was hot-plugged back in, it resulted in duplicate /dev entries > for da1s1e, da1s1, and da1. > > This fix adds a new disk_gone() function which is called by CAM when a > drive goes away. It orphans all of the providers associated with the > drive, setting an error condition of ENXIO in each one. In addition, > we prevent a re-taste on last close for writing if an error condition > has been set in the provider. > > If disk is gone it should orphan its provider. This should cause > orphaning depended providers, including da1s1 and then da1s1e. > Orphaning sets provider's error and providers with the error set are not > retasted. This is from g_access(): > > else if (pp->acw != 0 && pp->acw == -dcw && pp->error == 0 && > !(pp->geom->flags & G_GEOM_WITHER)) > g_post_event(g_new_provider_event, pp, M_WAITOK, > pp, NULL); > > My question is: is disk_geom() really needed? Maybe there is some small > bug somewhere, but I think GEOM should handle all this without disk_gone(). > I suspect the issue is with CAM/SIM and the geom commit has just exposed the problem. I've been hitting hangs with device removal which appear very related to this: http://lists.freebsd.org/pipermail/freebsd-usb/2012-September/011489.html
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5062E806.8000707>