Date: Sun, 16 Jun 2013 11:46:17 +0100 From: David Chisnall <theraven@FreeBSD.org> To: Florent Peterschmitt <florent@peterschmitt.fr> Cc: "freebsd-current@freebsd.org FreeBSD" <freebsd-current@FreeBSD.org> Subject: Re: Handle kernel module crashes Message-ID: <B4C001E3-86FC-409C-8B33-A52E7115E0C1@FreeBSD.org> In-Reply-To: <51B5E575.1030006@peterschmitt.fr> References: <51B5E040.2030709@peterschmitt.fr> <CAFMmRNxPcmx4gtwQfLjaFnMhAxBcBzYBd45vxJDcAU55ZFirQw@mail.gmail.com> <51B5E575.1030006@peterschmitt.fr>
next in thread | previous in thread | raw e-mail | index | archive | help
--Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 On 10 Jun 2013, at 15:40, Florent Peterschmitt <florent@peterschmitt.fr> = wrote: > Ok and isn't it a "bad" thing ? I mean, even if the video driver > crashes, I still want to have the ability to reboot the right way, > avoiding corrupted files and WIP lose. >=20 > Another thing is a non-critical module that can crash, but because not > used by all apps on the machine, letting them ones that can continue = run. >=20 > But I don't know what is the approach of FreeBSD and devs about that. Yes, it's a bad thing. If we had privilege domain crossing that was as = cheap as a function call (or, at least, almost as cheap) then we could = implement fine-grained separation within the kernel and not incur any = performance penalty. Unfortunately, this is not possible without some = fairly significant changes to current CPU instruction sets (which, = actually, several of us in FreeBSD land are working on, but that's = unlikely to be seen in any mainstream processor for at least 5-10 = years). =20 In the current world, we have a fairly poor selection of choices for = isolation. On i386, we had 4 protection rings, but on the 486 and newer = the cost of transitions between to and from rings 1 and 2 were = increasingly expensive because most operating systems only used rings 0 = and 3 (Netware and OS/2 are the two exceptions that I know of). On = other architectures we just have privileged and unprivileged modes. = Code in privileged mode can't be isolated from other code in privileged = mode, code that is in unprivileged mode incurs some overhead for calls = into privileged mode. There are some tricks that you can do to enforce some weaker protection. = For example, every driver could be written on 64-bit platforms to use = 32-bit pointers and have a 4GB segment of privileged-mode virtual memory = allocated for it to use and have to go through special gates to do = anything with the whole kernel's address space. You'd then end up with = a lot more TLB churn, but protection against a number of kinds of = pointer error (protection faults inside the 32-bit window would just = result in that module being killed and restarted). =20 Unfortunately, there are several problems with this. The most obvious = is that killing a module is not always trivial. For example, a module = may hold various locks, but it's not always clear which module owns a = lock. Locks are held by kernel threads, but a thread can have a call = stack spanning several modules. Working out exactly which driver holds = the lock is not always trivial, and there is also the question of what = you do about a thread that contains some call frames belonging to the = module that you've just killed. You'd need to provide some = exception-like mechanism for handling this case (and unwinding the stack = in the case where it is potentially corrupt is also nontrivial). =20 An alternative is to run the driver entirely, or mostly, in userspace. = The 'mostly' option is often better. For example, certain categories of = USB devices are exposed by the FreeBSD kernel as USB generic devices = (ugen driver) and some userspace component sends USB commands to it. = This involves some extra copying, but means that most of the = (potentially buggy) driver logic is in the application. If it crashes, = you lose the application state (which, in a desktop setting, is only = slightly better than crashing the kernel), but not the whole kernel. =20 In the case of certain modern network interfaces (Infiniband in = particular) and modern GPUs, the kernel handles even less. The device = has some hardware support for multiplexing and isolation and so all that = the kernel has to do is set up some memory that both the device and the = userspace code can access - including the device registers for = controlling a command queue - and then delegate most of the operation to = the userspace code. This requires an IOMMU to actually provide = isolation, otherwise an errant DMA request can still result in accessing = or modifying kernel memory. Even with this kind of isolation, there are still potential problems. = Many devices react poorly to bad input and can be left in a state that = is hard to recover from, even if the driver itself is easy to restart. = A lot of OS instability (I saw a number as high as 20% of OS crashes = quoted at MSR recently) is caused by drivers poorly reacting to = intermittent hardware errors. Just restarting the driver (an approach = that they tried) solved some, but not all of these cases. Of course, there are a lot of things in the kernel that are not drivers. = For example, FUSE allows us to run filesystems in userspace instead of = in the kernel. This comes with a performance penalty as a result of = having to copy data from the kernel's buffer cache into the filesystem = process, then back into the kernel, and then into the destination = process (for a read - the same sequence in the opposite order on write). = Similarly, we have CUSE for character devices, which is used by a lot = of webcam drivers. These are a relatively good use-case for userspace = drivers, because they are typically a streaming interface (data comes = just from the device and there isn't a lot of need for latency-sensitive = round trips from the app to the driver) and the latency that users care = about is on the order of 1/24th of a second, which is a very long time = on a modern computer. There are other examples, such as Netmap for = pushing network packets directly into userspace, which can be combined = with something like Ilias Marinos' userspace network stack to run the = entire TCP/IP stack in userspace. Moving drivers into userspace is not a panacea. It adds more = asynchronous behaviour, which makes reasoning about the code harder and = makes deadlocks far easier to introduce (for example, any userspace = process has a lot of implicit interactions with the VM subsystem, which = are more explicit in the kernel, and doesn't have a shared global = namespace for locks). Most of the code in the kernel is there because, = when the code was written, it was the most sensible place for it. In = most cases, that is still true, although as CPU and software = architectures evolve that may change. David --Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.18 (Darwin) Comment: GPGTools - http://gpgtools.org iQIcBAEBAgAGBQJRvZd6AAoJEKx65DEEsqIdpy0QAMYTKaeKqbNXoRBv0+JVMnMi 1cZI4O6WKDJ573tHKd0HH+/ijl7P35X3tX8hdIdLP40R+x+SeImQj/64rcVrogaj 8pPNHeMqlC5cdG2DyBDkSXbjibGpW1vQZVvIbgCP+vlfcfbjUBLUC8WfG2Mjb/uA GqZhMJ2JkKqHg1N4hxLUSMSJtsqecBfw5ZDa0qWu30TL8aIFoJ3ExzuFQksaMoqd DuHv+hisMQ5kQDmSXyWS9cWjsaqzBP3rQemP7aVuaD7vsnG6qs6tuuXJyoJwcc2f V0nUUEiTuF/ZwcRguU77XdfPyfWFqqJTmCIFrPR5c1vU+lop6G/dV5BRsFpBZ3dN XrYvb4BIbUszevHl0Yz9eCfDeDF41jWtsw/FiA7xxfMmVnesWCz35vZlIK8DTNBj TqWrtl5RvabsmdtniuvcRMHm0X4m9b4ia1p/QQAjmiKHO2My6/cAVHdTPKkA7p6D WoipuLX5GfrhSPVxVpa9DHQwtTJPTqlIgSyUiRYIB0Euo1N1EXS4vAsTVZrh4FJQ ywJane3XwWKt2pb89a3AAtupzUyw1lJUiogIjAUxwkpHcS6jFASIagTk8Hc8u+iL ZyQZ+BZ/wxmU2lJk7geo7srpHOw/HlArsgZM23qEJC3AD3ix2zLZDFRE3KIEqAP+ Zf5AXT1BOZ23qSHwJEML =FHc5 -----END PGP SIGNATURE----- --Apple-Mail=_2822E116-B807-4636-A85C-48F2E3D24CE8--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?B4C001E3-86FC-409C-8B33-A52E7115E0C1>