From owner-freebsd-hackers@freebsd.org Sat Mar 3 12:16:40 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8E04AF36B4C for ; Sat, 3 Mar 2018 12:16:40 +0000 (UTC) (envelope-from robert.watson@cl.cam.ac.uk) Received: from cyrus.watson.org (cyrus.watson.org [204.107.128.30]) by mx1.freebsd.org (Postfix) with ESMTP id 381437E732; Sat, 3 Mar 2018 12:16:40 +0000 (UTC) (envelope-from robert.watson@cl.cam.ac.uk) Received: from [10.0.1.23] (host109-151-50-63.range109-151.btcentralplus.com [109.151.50.63]) by cyrus.watson.org (Postfix) with ESMTPSA id E64C11E59F; Sat, 3 Mar 2018 12:16:38 +0000 (UTC) From: "Robert N. M. Watson" Message-Id: <1ED213FC-D0CA-46D9-B6D1-EA261F2B80F5@cl.cam.ac.uk> Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: [capsicum] unlinkfd Date: Sat, 3 Mar 2018 12:16:34 +0000 In-Reply-To: Cc: Justin Cormack , "" , freebsd-hackers@freebsd.org, Mariusz Zaborski To: Alex Richardson References: <20180302183514.GA99279@x-wing> <17DE0BFF-42A2-4CD7-B09C-ABA2606C4041@cl.cam.ac.uk> X-Mailer: Apple Mail (2.3445.5.20) X-Mailman-Approved-At: Sat, 03 Mar 2018 15:07:03 +0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.25 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Mar 2018 12:16:40 -0000 In general, in UNIX, "unlink" is a namespace operation relative to a = directory, and not an operation on a file, so I wouldn't expect to have = a system call that searches a directory looking for a matching file, but = rather always a call that specifies the specific segment to remove (as = there may well be more than one of them). It seems to me like there are a few different use cases: (1) Just want some temporary non-persistent file-like storage please. = Here, swap-backed anonymous objects are probably generally preferable, = although if they will be huge, perhaps a filesystem is a better place to = back them. (2) Want a temporary (non-persistent) hierarchal namespace full of = file-like things. This need is not well served, as you need to not only = create this within a current filesystem, but garbage collection of the = results is not reliable in the presence of crashes/etc. (3) Want capability-based access to a persistent hierarchal namespace = full of files. This is well served by the current at(2) system calls = along with filesystems, although there are API gaps (e.g., a lack of = unlinkat(2) in FreeBSD). Because of the complexity of (2), a Casper service is likely the way to = go. We should fill the API gaps on (3) through new POSIX-like at(2). For = (1), the real issue is if the current swap-backed APIs are insufficient, = in which case a Casper service might be the way to go. Robert > On 3 Mar 2018, at 11:31, Alex Richardson = wrote: >=20 > Linux has a unlinkat() system call = (https://linux.die.net/man/2/unlinkat = ) but it doesn't seem to have a = flag that lets you unlink the fd itself. > Possibly pathname =3D=3D NULL and AT_EMPTY_PATH could mean unlink the = fd but I haven't tried whether that works. > It also has a AT_REMOVEDIR flag to make it function as rmdirat(). >=20 > Alex >=20 > On 3 March 2018 at 10:41, Robert N. M. Watson = > wrote: > FWIW, this is part of why we introduced anonymous POSIX shared memory = objects with Capsicum in FreeBSD -- we allow shm_open(2) to be passed a = SHM_ANON special name, which causes the creation of a swap-backed, = mappable file-like object that can have I/O, memory mapping, etc, = performed on it .. but never has any persistent state across reboots = even in the event of a crash. >=20 > With Capsicum you can then refine a file descriptor to the otherwise = writable object to be read-only for the purposes of delegation. There is = not, however, a mechanism to "freeze" the state of the object causing = other outstanding writable descriptors to become read-only -- certainly = something could be added, but some care regarding VM semantics would be = required -- in particular, so that faults could not be experienced as a = result of an memory store performed before the "freeze" but issued to = VFS only later. >=20 > I certainly have no objection to an unlinkat(2) system call -- it's = unfortunate that a full suite of the at(2) APIs wasn't introduced in the = first place. It would be worth checking that no one else (e.g., Solaris, = Mac OS X, Linux) hasn't already added an unlinkat(2) that we can match = API semantics for. I think I take the view that for truly anonymous = objects, shm_open(2) without a name (or the Linux equiv) is the right = thing -- and hence unlinkat(2) is for more conventional use cases where = the final pathname element is known. >=20 > On directories: There, I find myself falling back on a Casper-like = service, since GC'ing a single anonymous memory object is = straightforward, but GC'ing a directory hierarchy is a more messy = business. >=20 > Robert >=20 > > On 3 Mar 2018, at 09:53, Justin Cormack = > = wrote: > > > > I think it would make sense to have an unlinkfd() that unlinks the = file from > > everywhere, so it does not need a name to be specified. This might = be > > hard to implement. > > > > For temporary files, I really like Linux memfd_create(2) that opens = an anonymous > > file without a name. This semantics is really useful. (Linux memfd = also has > > additional options for sealing the file fo make it immutable which = are very > > useful for safely passing files between processes.) Having a way to = make > > unnamed temporary files solves a lot of deletion issues as the file > > never needs to > > be unlinked. > > > > > > On 2 March 2018 at 18:35, Mariusz Zaborski > wrote: > >> Hello, > >> > >> Today I would like to propose a new syscall called unlinkfd(2) = which came up > >> during a discussion with Ed Maste. > >> > >> Currently in UNIX we can=E2=80=99t remove files safely. If we will = try to do so we > >> always end up in a race condition. For example when we open a file, = and check > >> it with fstat, etc. then we want to unlink(2) it=E2=80=A6 but the = file we are trying to > >> unlink could be a different one than the one we were fstating just = a moment ago. > >> > >> Another reason of implementing unlinkfd(2) came to us when we were = trying > >> to sandbox some applications like: uudecode/b64decode or bspatch. = It occured > >> to us that we don=E2=80=99t have a good way of removing single = files. Of course we can > >> try to determine in which directory we are in, and then open this = directory and > >> remove a single file. > >> > >> It looks even more bizarre if we would think about a program which = operates on > >> multiple files. If we would analyze a situation with two totally = different > >> directories like `/tmp` and `/home/oshogbo` we would end up with = pre opening > >> a root directory or keeping as many directories as we are working = on open. > >> All of that effort only to remove two files. This make it totally = impractical! > >> > >> I think that opening directories also presents some wider attack = vector because > >> we are keeping a single descriptor to a directory only to remove = one file. > >> Unfortunately this means that an attacker can remove all files in = that directory. > >> > >> I proposed this as well on the last Capsicum call. There was a = suggestion that > >> instead of doing a single syscall maybe we should have a Casper = service that > >> will allow us to remove files. Another idea was that we should = perhaps redesign > >> programs to create some subdirs work on the subdirs and then remove = all files in > >> this subdir. I don=E2=80=99t feel that creating a Casper service is = a good idea because > >> we still have exactly the same issue of race condition. In my = opinion creating > >> subdirs is also a problem for us. > >> > >> First we would need to redesign some of our tools and I think we = should > >> simplyfiy capsicumizition of the process instead of making it = harder. > >> > >> Secondly we can create a temporary subdirectory but what will = remove it? > >> We are going back to having a fd to directory in which we just = created a subdir. > >> Another way would be to have Casper service which would remove a = directory but > >> with the risk of RC. > >> > >> In conclusion, I think we need syscall like unlinkfd(2), which turn = out taht it > >> is easy to implement. The only downside of this implementation is = that we not > >> only need to provide a fd but also a path file. This is because = inodes nor > >> vnodes don=E2=80=99t contain filenames. We are comparing vnodes of = the fd and the given > >> path, if they are exactly the same we remove a file. In the syscall = we are using > >> a fd so there is no Ambient Authority because we are proving that = we already > >> have access to that file. Thanks to that the syscall can be safely = used with > >> Caspsicum. I have already discussed this with some people and they = said > >> `Hey I already had that idea a while ago=E2=80=A6` so let=E2=80=99s = do something with that idea! > >> If you are intereted in patch you can find it here: > >> https://reviews.freebsd.org/D14567 = > >> > >> Thanks, > >> -- > >> Mariusz Zaborski > >> oshogbo//vx | http://oshogbo.vexillium.org = > >> FreeBSD commiter | https://freebsd.org = > >> Software developer | http://wheelsystems.com = > >> If it's not broken, let's fix it till it is!!1 > > >=20 >=20 >=20