Date: Sun, 18 Jan 2015 12:54:10 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Hans Petter Selasky <hps@selasky.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r277199 - in head/sys: fs/devfs kern Message-ID: <20150118105410.GN42409@kib.kiev.ua> In-Reply-To: <54B8D31B.9030805@selasky.org> References: <201501142207.t0EM7Dfn041543@svn.freebsd.org> <20150115033109.GM42409@kib.kiev.ua> <54B76F2B.8040106@selasky.org> <20150115093841.GX42409@kib.kiev.ua> <54B79B25.3070707@selasky.org> <20150115115123.GA42409@kib.kiev.ua> <54B7AF2F.3080802@selasky.org> <20150116080332.GE42409@kib.kiev.ua> <54B8D31B.9030805@selasky.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote: > Hi Konstantin, > > On 01/16/15 09:03, Konstantin Belousov wrote: > > My opinion is that you should have tried to handle the issue at the driver > > level, instead of making this devfs issue. I.e., if you already have > > cdev node with the correct name, driver should have replaced the private > > data to point to new device. > > I think this way cannot be implemented in a clean way, because of > locking order reversal. And if you try to avoid the LOR you end up with > the race. Chess mate sort of ;-) Let me explain: > > Thread 1: > usb_sx_lock(); > cdev = Look in freelist for existing device(); > else > cdev = make_dev(); > usb_sx_unlock(); > > Thread 2: > usb_sx_lock(); > put cdev on freelist > usb_sx_unlock(); > > Thread 3: > usb_sx_lock(); > cdev = remove first entry in freelist > usb_sx_unlock(); > > /* > * XXX because USB needs to call destroy_dev() unlocked we > * are racing with Thread 1 again > */ > destroy_dev(cdev); I do not understand the 'freelist' part. You have some usb (or whatever other kind) device, which is represented by some data structure. This data structure references cdev. In reverse, cdev points to this data structure by si_drv1 or similar driver-owned member of struct cdev. The structures naming usb devices have some mapping to/from usb bus addresses. When device is destroyed or created due to external plug event, there is some mechanism that ensures mapping consistence. It could be lock, it could be single-threaded process which discover the bus, or something else, I do not know. While the structure notes that device with some address goes out and comes in, it can look up the corresponding cdev and just change the si_drv1 pointer to point to the new device. I find it very strange that you need sleepless operation which can _both_ destroy and create cdev. At least one operation of creation or destruction must sleep, at least in the current devfs design. It could be made sleepless, by making VFS visible part even less connected to the cdev part, but this is not how it (can) currently work. > > > > > This would also close a window where /dev node is non-existent or operate > > erronously. > > I'm not saying I plan so, but I think "cdevs" at some point need to > understand mutexes and locks. That means like with other API's in the > kernel we can associate a lock with the "cdev", and this lock is then > used to ensure an atomic shutdown of the system in a non-blocking > fashion. In my past experience multithreaded APIs should be high level > implemented like this: > > NON-BLOCKING methods: > lock(); ** > xxx_start(); > xxx_stop(); > unlock(); > > BLOCKING methods: > setup(); // init > unsetup(); // drain > > Any callbacks should always be called locked ** > > In devfs there was no non-blocking stop before I added the delist_dev() > function. > > > > > You do not understand my point. > > > > I object against imposing one additional global reference on all cdevs > > just to cope with the delist hack. See the patch at the end of the message. > > It's fine by me. > > > > WRT destroy_dev_sched_cb() calling delist_dev(), even after calling > > delist_dev(), the node still exists in the /dev. It is only removed > > after populate loop is run sometime later. dev_sched() KPI is inheritly > > racy, drivers must handle the races for other reasons. > > The populate loop is all running under the dev_lock() from what I can > see and make_dev() is also keeping the same lock when inserting and > removing new cdevs. The populate loop should always give a consistent No, populate loop is not run under dev_mtx. It would not be able to allocate memory, even in M_NOWAIT fashion. The dev_mtx is after the vm and vnode locks. Only iteration over the cdevp_list is protected by dev_mtx, which is dropped right after something which require an action, is found on the list. Populate() needs some way to avoid reading inconsistent data from the cdev layer, but there is not guarantee that we are always up to date. > view of the character devices available, and I don't see how "cdev" > structures without the CDP_ACTIVE flag can appear with recently created > ones, even if the name is the same. It cannot simply because it is not synchronized with the device creation/destruction. > > > > >> > >>> Entry can be random, since after the populate loop is ran, your code in > >>> other thread could start and create duplicate entry. There is a window > >>> in the lookup where both directory vnode lock and mount point sx locks > >>> are dropped. So running the populate does not guarantee anything. > >> > >> If there is such a race, it is already there! My patch changes nothing > >> in that area: > >> > >> Thread1: > >> Calls destroy_dev() and clears CDP_ACTIVE, after dev_unlock() it goes > >> waiting for some refs for xxx milliseconds. > >> Thread2: > >> Tries to create create a new character device having the same name like > >> the one in thread1. Device name duplication check is missed because > >> CDP_ACTIVE is cleared. Still thread1 is waiting. > >> Thread3: > >> Tries to open character device created by thread2 while thread1 is still > >> waiting for some ref held by a userspace app to go away. > >> > >> This can happen already before my patches! What do you think? > > Possibly. > > > > > >> At what level do you mean duplicate names, I don't get this fully? At > >> the directory level (DE nodes)? Or inside the list of character devices > >> (LIST_XXX)? > > It does not matter, dup at either one directory level, or dup of full > > names in the global list are equivalent (bad) things. > > Like I write above I don't see where the problem is. At the cdev level, > we are protecting the cdev's LIST with dev_lock() and only one entry > will exist having CDP_ACTIVE bit set per unique cdev name and path. Else > we will hit a panic in make_dev() and friends. > > In the directory entry level the populate loop will also ensure a > consistent view, and hence the cdev's LIST is consistent, the view > presented to userspace will also be consistent. Lookup (or any other VFS-level code) is not protected by dev_mtx, it is only dm lock which ensures that populate loop is not run in parallel, modifying direntries. > > That system functions can still call into the dangling read/write/ioctl > functions is another story, and that is why I tell, that in order to > simplify this teardown, we possibly should associate a client selectable > lock with each CDEV, for teardown purposes. Like done for several years > in the callout and USB APIs and possibly many other places. There is d_purge method which provides the tear-down semantic for drivers which needs it. Only tty uses the method AFAIK. Devfs ensures that no client code is active when cdev is removed for real. > > > > >> > >>> Let me summarize: > >>> - the extra reference on arbitrary cdev should be eliminated. The > >>> delist_dev_locked() may add the ref and set some CDP_ flag to > >>> indicate to destroy_dev() that it should do dev_rel(). > >> > >> It is possible to do this. I thought about this before doing my patch, > >> but decided to try to avoid adding yet another cdev flag. > >> > >>> - the existence of the duplicated entries should be either eliminated > >>> (I am not sure it is possible with your code), or we must ensure > >>> that only one name with CDP_ACTIVE flag set and given name exists. > >>> Then, all lookup code must be audited to take CDP_ACTIVE into account > >>> when accessing names. I see at least devfs_find() and devfs_mknod() > >>> which must be changed. I did not performed full audit. > >> > >> I will check this path out aswell. > >> > > > > It seems it is simpler for me to try to clean up after the commit. > > The patch was only lightly tested. I post the patch for discussion, > > not for you to committing it. I will expedite the change into HEAD > > after the consensus on it is made and adequate testing is performed. > > I don't see any problems about your patch, except it adds a bit more > code to the kernel.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150118105410.GN42409>