From owner-freebsd-current@FreeBSD.ORG Mon Apr 3 17:16:12 2006 Return-Path: X-Original-To: freebsd-current@FreeBSD.org Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3EBB016A427; Mon, 3 Apr 2006 17:16:12 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id F202C43D45; Mon, 3 Apr 2006 17:16:00 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 215A946BE4; Mon, 3 Apr 2006 13:15:59 -0400 (EDT) Date: Mon, 3 Apr 2006 18:15:58 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: "Marc G. Fournier" In-Reply-To: <20060403132401.I947@ganymede.hub.org> Message-ID: <20060403174952.E76562@fledge.watson.org> References: <20060403003318.K947@ganymede.hub.org> <20060403163220.F36756@fledge.watson.org> <20060403132401.I947@ganymede.hub.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: pjd@FreeBSD.org, freebsd-current@FreeBSD.org, freebsd-stable@FreeBSD.org Subject: Re: new feature: private IPC for every jail X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Apr 2006 17:16:12 -0000 On Mon, 3 Apr 2006, Marc G. Fournier wrote: > On Mon, 3 Apr 2006, Robert Watson wrote: > >> (1) The fact that system v ipc primitives are loadable, and unloadable, >> which requires some careful handling relating to registration order, etc. > > For this one, I'm lost at the issue ... if not loaded, jail processes just > couldn't attach ... if loaded, and you try to unload, while there are shared > memory segments in play, don't unload ... or is there something i'm missing > here? What happens now, if I load ipc, start up postgresql and then try to > unload ipc? I hardcode all the stuff I use in my kernel, so don't use the > load/unload mechanism, so can't test this easily ... The problem is the relationship between jails and loadable System V IPC, and has to do with how you might implement the relationship between the two subsystems. There are two general ways to approach adding virtualization to the System V IPC name spaces: (1) Add a general virtualization facility, which causes the current process and its children to see a new name space. (2) Key virtualization to the identity of the jail. When dealing with the file system, jail relies on the existing chroot() subsetting facility to introduce virtualization. This is a nice piece of behavior, as it means file system subsetting is a facility available to be used regardless of the use of jail, and avoids hard-coding jail instrumentation throughout the file system code. So the question is this: if you load System V IPC support after you start a jail, how do we handle jails that have already started? Do we go out and create new name spaces for jails already started (a problem for method (1), because it implies System V IPC will have pretty intimate knowledge of jails, and know how to walk lists, etc), do we deny access to System V IPC for jails not present when it was loaded? Likewise, although we tend to refer to the different IPC mechanisms as in a single category, System V IPC, there are actually three name spaces, and the functionality for each can be loaded separately. It's not that these questions can't be answered, but they do have to be answered. My leaning, btw, in implementing this would be to: - For each System V IPC mechanism, implement a mechanism to create a new name space, to be used by the current process and any children (until they replace them with a new one, similar to chroot). - In jail(), similar to the way in which it uses chroot() to subset the file system name space, cause the creation of new name spaces, if the IPC services are present. - We'll need a way to flag jails as not permitting any System V IPC of a particular type, to be used when the IPC service isn't loaded at the time jail() is to create a new jail, and the System V IPC services will need to check those flags (or whatever). - We'll need a way to name the new name spaces (unlike the file system, we can't rely on an existing facility), and we'll need to enhance the System V IPC monitoring and management tools. For example, ipcs, ipcrm, etc, will need to know about this, the kernel interfaces for management will need to know how to deal with name spaces, and they will have to make sure to use the right checks and decide how to represent the fact that processes in jails should not be able to see name spaces other than their name space. Maybe this is a flag to name space creation, but something is needed here. Note there are some other tricky dependency problems, such as the fact that jails have to interact with code that may or may not be loaded, how to have jail vs IPC notions of privilege interact, etc. >> (2) The name space model for system v ipc is flat, so while it's desirable >> to allow the administrator in the host environment to monitor and control >> resource use in the jail (for example, delete allocated but unused >> segments), doing that requires developing an administrative model for it. > > Again, you've lost me here ... how is that different then not using a jail? > from the root server, one does an 'ipcs -a' and ipcrm as required ... the > only thing I could think of 'being a nice thing' here is to maybe add a > 'jail' value, simpler to what is in proc, so that you know what segments > below to a specific jail ... > > I'm free to admit that I may be missing something you are seeing as obvious, > mind you ;) > > For instance, are you suggesting that 'root' in the jail himself could issue > ipcs -a and ipcrm? I'm referring to how ipcs, ipcrm, etc, in the host environment interact with the IPC resources in the jail environments. In particular, I'm making the assumption that it is useful and desirable for the administrator running in the host to be able to directly monitor allocation in the jails, and manage that allocation, without running the management commands in the jail. I'm not sure if you've ever programmed to the System V IPC API, but if you have done so, you'll know that the name space for IPC objects is "odd". It's non-hierarchal, and hence highly subject to collisions between applications. This means that we can't use neat tricks, such as chroot() in the file system, to implement virtualization. If you compare the behavior of MySQL in a Jail with PostgreSQL, you'll see how this plays out immediately: MySQL uses UNIX domain sockets by default, and this means it "just works" with Jail, as the UNIX domain socket name space is, in fact, the file system name space. If MySQL uses /tmp/mysql.sock in a jail, it's virtualized by virtue of the fact that /jail/www.whatever.com/tmp doesn't, by definition, collide with /jail/www.notanother.com/tmp. Because the System V IPC name space is non-hierarchal, we have to deal with the fact that names can and do collide. If each jail has its own name space, for example, and each contains a PostgreSQL session with an ID of 54321 (made up), then a process in the host environment can't simply issue the normal System V IPC system calls in order to delete them, because those calls have no way to express "which name space" the operation is in. In the jail, this is OK, because applications will get whatever the jail-local name space is. But outside the Jail, these commands would see the name space for the host, but none of the contents of the Jail's name spaces. In essense, this mean that we need to add new interfaces to allow ipcs, ipcrm, etc, to run outside the jails yet see and operate on objects in the jails. Again, this can be done, but the details are non-trivial, since they raise hard questions about generalization, interactions between dynamically loaded components, access control, name spaces etc. This is why no one has done it yet. Several people, including myself, have sat down and done the first 30% hack -- enough to get things working a bit, and to bump into all the tricky parts (see above). Robert N M Watson