From owner-freebsd-arch@FreeBSD.ORG Wed Apr 16 14:47:14 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 380A01065671; Wed, 16 Apr 2008 14:47:14 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 1DC578FC1D; Wed, 16 Apr 2008 14:47:14 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from zion.baldwin.cx (unknown [208.65.88.170]) by elvis.mu.org (Postfix) with ESMTP id 7FBFA1A4D8C; Wed, 16 Apr 2008 07:47:13 -0700 (PDT) From: John Baldwin To: Pawel Jakub Dawidek Date: Wed, 16 Apr 2008 10:14:40 -0400 User-Agent: KMail/1.9.7 References: <20071218092222.GA9695@freebsd.org> <200712201138.56423.jhb@freebsd.org> <20080412112019.GI45299@garage.freebsd.pl> In-Reply-To: <20080412112019.GI45299@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200804161014.41025.jhb@freebsd.org> Cc: Roman Divacky , kib@freebsd.org, rwatson@freebsd.garage.freebsd.pl, freebsd-arch@freebsd.org Subject: Re: final decision about *at syscalls X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Apr 2008 14:47:14 -0000 On Saturday 12 April 2008 07:20:19 am Pawel Jakub Dawidek wrote: > On Thu, Dec 20, 2007 at 11:38:55AM -0500, John Baldwin wrote: > > On Tuesday 18 December 2007 04:22:22 am Roman Divacky wrote: > > > Dear arch@ > > > > > > Over this summer I was working (among other things) on *at family of > > > syscalls kindly sponsored by Google (in their Summer of Code). The > > > resulting patch is almost finished but I need to decide one design > > > question. If you are not interested in *at/namei feel free to skip this > > > mail. > > > > > > The *at syscalls are a threads-oriented extension to basic file > > > syscalls (think of open(), fstat(), etc.) adding the possibility to > > > specify from where the search for relative path should start. > > > > > > image that we have /tmp/foo/bar > > > > > > and CWD is set to "/tmp/", and the process has opened "foo" as dirfd. > > > with ordinary open() syscall you have to either > > > > > > chdir("/tmp/foo");open("./bar"); > > > > > > or > > > > > > open("/tmp/foo/bar"); > > > > > > The first approach is problematic because it changes CWD for all > > > threads in the process, the second is prone to race-conditions as some > > > of the components of the path can change in parallel with the "open". > > > > > > So POSIX introduced a new API, called "Extended API set part 2, ISBN: > > > 1-931624-67-4" (at least this was the latest when I looked last time), > > > which solves that by introducing "*at" syscalls that supply an fd of > > > previously opened directory which is used instead of CWD for searching > > > relative path, ie. the previous example becomes > > > > > > dirfd = open("/tmp/foo"); openat("foo", dirfd); > > > > > > I implemented the whole API as native FreeBSD syscalls + in linuxulator > > > emulation layer. Here's the problem: > > > > > > There are two approaches to the name translation from "filedescriptor" > > > to the "vnode". > > > > > > 1) we can do it in the kern_fooat() syscall and pass namei() the > > > resulting vnode 2) we can pass namei() the filedescriptor and do the > > > translation there > > > > > > PROs of #1: > > > > > > o namei() does not need to know about the curthread, you can use this > > > *at ability for different purposes, it's cleaner (imho) > > > > > > PROs of #2 > > > > > > o raceless implementation > > > o no code duplication > > > > > > CONs of #1 > > > > > > o some very small code duplication (the translation is done in every > > > kern_fooat() function) > > > o there is a race between the name translation and the actual use of > > > the result of the translation that needs to be handled, the > > > "path_to_file" string is copied to the kernel space twice hence a race > > > > > > CONs of #2 > > > > > > o namei is made thread dependant > > > > > > Please tell me what approach you like more. I personally favour #1 > > > because I don't like namei() being thread dependant, Kostik Belousov > > > prefers #2. > > > > Considering Robert's paper on security race problems in things like > > systrace stemming from when you copy parameters out of userland and into > > the kernel multiple times, I think #2 is definitely the better choice. > > Also, namei() is already thread aware AFAICT since 'struct componentname' > > already contains a 'cnp_thread' member (was 'cnp_proc' in 4.x). > > It looks like I'm a bit too late, but anyway... > > From what you write John, #1 is a better choice than #2. If you want to > avoid races, you can pass already locked vnode. In case of file > descriptors, if p_fd is not locked another thread can close and open > different directory under the same descriptor number. Did you read Robert's paper? Do you not realize that the kernel copying data in from userland multiple times and having it change in between is very bug prone? -- John Baldwin