From owner-svn-src-all@FreeBSD.ORG Sun Jan 18 12:54:27 2015 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 7F6DDAF7; Sun, 18 Jan 2015 12:54:27 +0000 (UTC) Received: from mail.turbocat.net (heidi.turbocat.net [88.198.202.214]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 1FA2E979; Sun, 18 Jan 2015 12:54:26 +0000 (UTC) Received: from laptop015.home.selasky.org (cm-176.74.213.204.customer.telag.net [176.74.213.204]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.turbocat.net (Postfix) with ESMTPSA id D2D821FE023; Sun, 18 Jan 2015 13:54:23 +0100 (CET) Message-ID: <54BBAD31.701@selasky.org> Date: Sun, 18 Jan 2015 13:55:13 +0100 From: Hans Petter Selasky User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Konstantin Belousov Subject: Re: svn commit: r277199 - in head/sys: fs/devfs kern References: <20150115033109.GM42409@kib.kiev.ua> <54B76F2B.8040106@selasky.org> <20150115093841.GX42409@kib.kiev.ua> <54B79B25.3070707@selasky.org> <20150115115123.GA42409@kib.kiev.ua> <54B7AF2F.3080802@selasky.org> <20150116080332.GE42409@kib.kiev.ua> <54B8D31B.9030805@selasky.org> <20150118105410.GN42409@kib.kiev.ua> <54BB942A.5070200@selasky.org> <20150118113150.GQ42409@kib.kiev.ua> In-Reply-To: <20150118113150.GQ42409@kib.kiev.ua> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Jan 2015 12:54:27 -0000 Hi Konstantin, On 01/18/15 12:31, Konstantin Belousov wrote: > On Sun, Jan 18, 2015 at 12:08:26PM +0100, Hans Petter Selasky wrote: >> Hi Konstantin, >> >> On 01/18/15 11:54, Konstantin Belousov wrote: >>> On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote: >>>> Hi Konstantin, >> >> What about events for "devd" that a new character devices is present in >> the system or being destroyed for user-space applications? > It is up to the driver to decide. If wanted, it could post the pair of > destroy/create itself (e.g., if there is some state which require > the userspace to reset), or it could decide not to. The later is > reasonable if the destroy/create is believed to be caused by a glitch > rather than the actual replug. I would really prefer if that path is chosen, that kern_conf.c exposes an API for this. Consider the following case: You have a daemon that is started and stopped based upon "devd" DEVFS create/destroy events. The daemon is started when the create event is received. Then the kernel side calls destroy_dev(), because the application still keeps the file handle opened, the destroy_dev() never sends the destroy event, and the application then never receives the destroy event --- another example of a totally invisible deadlock. >>> >>> I find it very strange that you need sleepless operation which can >>> _both_ destroy and create cdev. At least one operation of creation or My simple argument against sleepable operation inside devfs functions which are used to register and unregister a character devices is simply that it leads to the fact we cannot make an atomic shutdown of a kernel application which involves devfs and other kernel subsystems. Let me explain. It is not enough that devfs works alone, it needs to work with together with other kernel APIs aswell. Let's make up an example. A kernel module is using three different APIs, lets say USB, the callout subsystem and devfs. Devfs is calling USB to start and stop USB transfers. The callout subsystem is calling selwakeup() and waking up devfs. The USB subsystem is calling selwakeup() and waking up devfs aswell. The USB subsystem is also calling destroy_dev(). This is a very high-level architectural thought and I hope you get it: How can you tear down a kernel application using three different subsystems in the kernel _atomically_ without having lingering callbacks? To make the shutdown sequence atomic, we might want to use a mutex. Now iff destroy_dev() is sleepable we need to drop that mutex when destroying the character device, and then woops - the shutdown sequence is no longer atomic. That's the root of the problem, and please think beyond devfs alone. The problem happens when devfs is used together with other APIs and applications in the kernel. >>> destruction must sleep, at least in the current devfs design. It could >>> be made sleepless, by making VFS visible part even less connected to the >>> cdev part, but this is not how it (can) currently work. >> >> I think it would be good practice that all resources needed for creating >> a character device can be allocated prior to registration. That means we >> first should allocate all resources needed, but not register as a single >> function. >> >> For example: >> >> Once make_dev() has returned, we can get an "d_open" callback. But >> "si_drv1/2" is always set by drivers _after_ that "make_dev()" has >> returned! This causes a race we could simply avoid by splitting the >> make_dev() like I suggest. Instead of putting more logic and code inside >> the drivers to handle the race! > I wanted to tunnel the si_drv values through the make_dev(). > I.e. there could appear yet another variant of make_dev() which takes > the initialization parameters for the si_* driver vars. > > I just did not have time/motivation to do the pass. I might be interested in giving you a hand there, but not right now. >> >>>> >>>>> >>>>> This would also close a window where /dev node is non-existent or operate >>>>> erronously. >>>> >>>> I'm not saying I plan so, but I think "cdevs" at some point need to >>>> understand mutexes and locks. That means like with other API's in the >>>> kernel we can associate a lock with the "cdev", and this lock is then >>>> used to ensure an atomic shutdown of the system in a non-blocking >>>> fashion. In my past experience multithreaded APIs should be high level >>>> implemented like this: >>>> >>>> NON-BLOCKING methods: >>>> lock(); ** >>>> xxx_start(); >>>> xxx_stop(); >>>> unlock(); >>>> >>>> BLOCKING methods: >>>> setup(); // init >>>> unsetup(); // drain >>>> >>>> Any callbacks should always be called locked ** >>>> >>>> In devfs there was no non-blocking stop before I added the delist_dev() >>>> function. >>>> >>>>> >>>>> You do not understand my point. >>>>> >>>>> I object against imposing one additional global reference on all cdevs >>>>> just to cope with the delist hack. See the patch at the end of the message. >>>> >>>> It's fine by me. >>>>> >>>>> WRT destroy_dev_sched_cb() calling delist_dev(), even after calling >>>>> delist_dev(), the node still exists in the /dev. It is only removed >>>>> after populate loop is run sometime later. dev_sched() KPI is inheritly >>>>> racy, drivers must handle the races for other reasons. >>>> >>>> The populate loop is all running under the dev_lock() from what I can >>>> see and make_dev() is also keeping the same lock when inserting and >>>> removing new cdevs. The populate loop should always give a consistent >>> No, populate loop is not run under dev_mtx. >> >> Are you sure: > Yes, I am sure. I explained it in the next sentences which you break from > the first one in the citation. "devfs_populate()" is called exclusivly locked by "dm_lock" and doesn't return until an atomic TAILQ_FOREACH() of the "devp_list" under "dev_lock()" is consistent - right? So the VFS node tree will not be inconsistent under "dm_lock" either - right? Or is there something which I don't see? void devfs_populate(struct devfs_mount *dm) { unsigned gen; sx_assert(&dm->dm_lock, SX_XLOCKED); gen = devfs_generation; if (dm->dm_generation == gen) return; >>> It would not be able to >>> allocate memory, even in M_NOWAIT fashion. The dev_mtx is after the >>> vm and vnode locks. Only iteration over the cdevp_list is protected by >>> dev_mtx, which is dropped right after something which require an >>> action, is found on the list. Populate() needs some way to avoid >>> reading inconsistent data from the cdev layer, but there is not >>> guarantee that we are always up to date. > ^^^^^^^^^ Are you taking the sleepable "dm_lock" into account? --HPS