From owner-svn-src-head@FreeBSD.ORG  Sun Jan 18 12:54:27 2015
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 7F6DDAF7;
 Sun, 18 Jan 2015 12:54:27 +0000 (UTC)
Received: from mail.turbocat.net (heidi.turbocat.net [88.198.202.214])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 1FA2E979;
 Sun, 18 Jan 2015 12:54:26 +0000 (UTC)
Received: from laptop015.home.selasky.org
 (cm-176.74.213.204.customer.telag.net [176.74.213.204])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by mail.turbocat.net (Postfix) with ESMTPSA id D2D821FE023;
 Sun, 18 Jan 2015 13:54:23 +0100 (CET)
Message-ID: <54BBAD31.701@selasky.org>
Date: Sun, 18 Jan 2015 13:55:13 +0100
From: Hans Petter Selasky <hps@selasky.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: svn commit: r277199 - in head/sys: fs/devfs kern
References: <20150115033109.GM42409@kib.kiev.ua>
 <54B76F2B.8040106@selasky.org> <20150115093841.GX42409@kib.kiev.ua>
 <54B79B25.3070707@selasky.org> <20150115115123.GA42409@kib.kiev.ua>
 <54B7AF2F.3080802@selasky.org> <20150116080332.GE42409@kib.kiev.ua>
 <54B8D31B.9030805@selasky.org> <20150118105410.GN42409@kib.kiev.ua>
 <54BB942A.5070200@selasky.org> <20150118113150.GQ42409@kib.kiev.ua>
In-Reply-To: <20150118113150.GQ42409@kib.kiev.ua>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org,
 src-committers@freebsd.org
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Jan 2015 12:54:27 -0000

Hi Konstantin,

On 01/18/15 12:31, Konstantin Belousov wrote:
> On Sun, Jan 18, 2015 at 12:08:26PM +0100, Hans Petter Selasky wrote:
>> Hi Konstantin,
>>
>> On 01/18/15 11:54, Konstantin Belousov wrote:
>>> On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote:
>>>> Hi Konstantin,

>>
>> What about events for "devd" that a new character devices is present in
>> the system or being destroyed for user-space applications?
> It is up to the driver to decide.  If wanted, it could post the pair of
> destroy/create itself (e.g., if there is some state which require
> the userspace to reset), or it could decide not to.  The later is
> reasonable if the destroy/create is believed to be caused by a glitch
> rather than the actual replug.

I would really prefer if that path is chosen, that kern_conf.c exposes 
an API for this. Consider the following case:

You have a daemon that is started and stopped based upon "devd" DEVFS 
create/destroy events. The daemon is started when the create event is 
received. Then the kernel side calls destroy_dev(), because the 
application still keeps the file handle opened, the destroy_dev() never 
sends the destroy event, and the application then never receives the 
destroy event --- another example of a totally invisible deadlock.

>>>
>>> I find it very strange that you need sleepless operation which can
>>> _both_ destroy and create cdev. At least one operation of creation or

My simple argument against sleepable operation inside devfs functions 
which are used to register and unregister a character devices is simply 
that it leads to the fact we cannot make an atomic shutdown of a kernel 
application which involves devfs and other kernel subsystems. Let me 
explain.

It is not enough that devfs works alone, it needs to work with together 
with other kernel APIs aswell. Let's make up an example. A kernel module 
is using three different APIs, lets say USB, the callout subsystem and 
devfs.

Devfs is calling USB to start and stop USB transfers.
The callout subsystem is calling selwakeup() and waking up devfs.
The USB subsystem is calling selwakeup() and waking up devfs aswell.
The USB subsystem is also calling destroy_dev().

This is a very high-level architectural thought and I hope you get it:

How can you tear down a kernel application using three different 
subsystems in the kernel _atomically_ without having lingering callbacks?

To make the shutdown sequence atomic, we might want to use a mutex. Now 
iff destroy_dev() is sleepable we need to drop that mutex when 
destroying the character device, and then woops - the shutdown sequence 
is no longer atomic. That's the root of the problem, and please think 
beyond devfs alone. The problem happens when devfs is used together with 
other APIs and applications in the kernel.

>>> destruction must sleep, at least in the current devfs design. It could
>>> be made sleepless, by making VFS visible part even less connected to the
>>> cdev part, but this is not how it (can) currently work.
>>
>> I think it would be good practice that all resources needed for creating
>> a character device can be allocated prior to registration. That means we
>> first should allocate all resources needed, but not register as a single
>> function.
>>
>> For example:
>>
>> Once make_dev() has returned, we can get an "d_open" callback. But
>> "si_drv1/2" is always set by drivers _after_ that "make_dev()" has
>> returned! This causes a race we could simply avoid by splitting the
>> make_dev() like I suggest. Instead of putting more logic and code inside
>> the drivers to handle the race!
> I wanted to tunnel the si_drv values through the make_dev().
> I.e. there could appear yet another variant of make_dev() which takes
> the initialization parameters for the si_* driver vars.
>
> I just did not have time/motivation to do the pass.

I might be interested in giving you a hand there, but not right now.

>>
>>>>
>>>>>
>>>>> This would also close a window where /dev node is non-existent or operate
>>>>> erronously.
>>>>
>>>> I'm not saying I plan so, but I think "cdevs" at some point need to
>>>> understand mutexes and locks. That means like with other API's in the
>>>> kernel we can associate a lock with the "cdev", and this lock is then
>>>> used to ensure an atomic shutdown of the system in a non-blocking
>>>> fashion. In my past experience multithreaded APIs should be high level
>>>> implemented like this:
>>>>
>>>> NON-BLOCKING methods:
>>>> lock(); **
>>>> xxx_start();
>>>> xxx_stop();
>>>> unlock();
>>>>
>>>> BLOCKING methods:
>>>> setup(); // init
>>>> unsetup(); // drain
>>>>
>>>> Any callbacks should always be called locked **
>>>>
>>>> In devfs there was no non-blocking stop before I added the delist_dev()
>>>> function.
>>>>
>>>>>
>>>>> You do not understand my point.
>>>>>
>>>>> I object against imposing one additional global reference on all cdevs
>>>>> just to cope with the delist hack.  See the patch at the end of the message.
>>>>
>>>> It's fine by me.
>>>>>
>>>>> WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
>>>>> delist_dev(), the node still exists in the /dev. It is only removed
>>>>> after populate loop is run sometime later. dev_sched() KPI is inheritly
>>>>> racy, drivers must handle the races for other reasons.
>>>>
>>>> The populate loop is all running under the dev_lock() from what I can
>>>> see and make_dev() is also keeping the same lock when inserting and
>>>> removing new cdevs. The populate loop should always give a consistent
>>> No, populate loop is not run under dev_mtx.
>>
>> Are you sure:
> Yes, I am sure.  I explained it in the next sentences which you break from
> the first one in the citation.

"devfs_populate()" is called exclusivly locked by "dm_lock" and doesn't 
return until an atomic TAILQ_FOREACH() of the "devp_list" under 
"dev_lock()" is consistent - right? So the VFS node tree will not be 
inconsistent under "dm_lock" either - right? Or is there something which 
I don't see?

void
devfs_populate(struct devfs_mount *dm)
{
         unsigned gen;

         sx_assert(&dm->dm_lock, SX_XLOCKED);
         gen = devfs_generation;
         if (dm->dm_generation == gen)
                 return;


>>> It would not be able to
>>> allocate memory, even in M_NOWAIT fashion.  The dev_mtx is after the
>>> vm and vnode locks.  Only iteration over the cdevp_list is protected by
>>> dev_mtx, which is dropped right after something which require an
>>> action, is found on the list. Populate() needs some way to avoid
>>> reading inconsistent data from the cdev layer, but there is not
>>> guarantee that we are always up to date.
> ^^^^^^^^^

Are you taking the sleepable "dm_lock" into account?

--HPS