From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 17:05:02 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4E141106564A; Thu, 1 Mar 2012 17:05:02 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-vx0-f182.google.com (mail-vx0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id C9B5A8FC13; Thu, 1 Mar 2012 17:05:01 +0000 (UTC) Received: by vcbfl15 with SMTP id fl15so821083vcb.13 for ; Thu, 01 Mar 2012 09:05:01 -0800 (PST) Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates 10.52.99.169 as permitted sender) client-ip=10.52.99.169; Authentication-Results: mr.google.com; spf=pass (google.com: domain of asmrookie@gmail.com designates 10.52.99.169 as permitted sender) smtp.mail=asmrookie@gmail.com; dkim=pass header.i=asmrookie@gmail.com Received: from mr.google.com ([10.52.99.169]) by 10.52.99.169 with SMTP id er9mr9140528vdb.126.1330621501308 (num_hops = 1); Thu, 01 Mar 2012 09:05:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=FFAhnUi55uRUY3tqk6a/YHI48a5bXe4jvv+rNuyj3uE=; b=Gxm0UCPxGjPU38mVxYBbylHdr+PPePT9zXJtuT9+AIFf/R0ywwFXg5NwaTk92Lri+h B2eqnYlZkxJDoW7E7xjUE0uXSDtwNbCJqJlogeTR6jLV2Ij2vGnv4OWBynHr0w6z73TO QoRoGLyR6Yp0uwAHrQ3o1VDG4UHpZZLGydWgY= MIME-Version: 1.0 Received: by 10.52.99.169 with SMTP id er9mr7754144vdb.126.1330621501096; Thu, 01 Mar 2012 09:05:01 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.220.38.72 with HTTP; Thu, 1 Mar 2012 09:05:00 -0800 (PST) In-Reply-To: <20120301153541.GZ55074@deviant.kiev.zoral.com.ua> References: <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> <20120301150125.GX55074@deviant.kiev.zoral.com.ua> <20120301151642.GY55074@deviant.kiev.zoral.com.ua> <20120301153541.GZ55074@deviant.kiev.zoral.com.ua> Date: Thu, 1 Mar 2012 17:05:00 +0000 X-Google-Sender-Auth: o5p0MzBSUTUxIxtcQxgyi0p3mjw Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 17:05:02 -0000 2012/3/1, Konstantin Belousov : > On Thu, Mar 01, 2012 at 03:23:21PM +0000, Attilio Rao wrote: >> 2012/3/1, Konstantin Belousov : >> > On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote: >> >> 2012/3/1, Konstantin Belousov : >> >> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: >> >> >> 2012/3/1, Konstantin Belousov : >> >> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: >> >> >> >> 2012/3/1, Pawel Jakub Dawidek : >> >> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: >> >> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: >> >> >> >> >> > - "Every file system needs cache. Let's make it general, so >> >> >> >> >> > that >> >> >> >> >> > all >> >> >> >> >> > file >> >> >> >> >> > systems can use it!" Well, for VFS each file system is a >> >> >> >> >> > separate >> >> >> >> >> > entity, which is not the case for ZFS. ZFS can cache one >> >> >> >> >> > block >> >> >> >> >> > only >> >> >> >> >> > once that is used by one file system, 10 clones and 100 >> >> >> >> >> > snapshots, >> >> >> >> >> > which all are separate mount points from VFS perspective. >> >> >> >> >> > The same block would be cached 111 times by the buffer >> >> >> >> >> > cache. >> >> >> >> >> >> >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call >> >> >> >> >> cache_entry() on your own), add a number of cache_prune calls. >> >> >> >> >> It's >> >> >> >> >> pretty much library-like design you describe below. >> >> >> >> > >> >> >> >> > Yes, namecache is already library-like, but I was talking about >> >> >> >> > the >> >> >> >> > buffer cache. I managed to bypass it eventually with >> >> >> >> > suggestions >> >> >> >> > from >> >> >> >> > ups@, but for a long time I was sure it isn't at all possible. >> >> >> >> >> >> >> >> Can you please clarify on this as I really don't understand what >> >> >> >> you >> >> >> >> mean? >> >> >> >> >> >> >> >> > >> >> >> >> >> Everybody agrees that VFS needs more care. But there haven't >> >> >> >> >> been >> >> >> >> >> much >> >> >> >> >> of concrete suggestions or at least there is no VFS TODO list. >> >> >> >> > >> >> >> >> > Everybody agrees on that, true, but we disagree on the >> >> >> >> > direction >> >> >> >> > we >> >> >> >> > should move our VFS, ie. make it more light-weight vs. more >> >> >> >> > heavy-weight. >> >> >> >> >> >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in >> >> >> >> replicating all the vnodes lifecycle at the inode level and in >> >> >> >> the >> >> >> >> filesystem specific implementation. >> >> >> >> I don't see a semplification in the work to do, I don't think >> >> >> >> this >> >> >> >> is >> >> >> >> going to be simpler for a single specific filesystem (without >> >> >> >> mentioning the legacy support, which means re-implement inode >> >> >> >> handling >> >> >> >> for every filesystem we have now), we just loose generality. >> >> >> >> >> >> >> >> if you want a good example of a VFS primitive that was really >> >> >> >> UFS-centric and it was mistakenly made generic is >> >> >> >> vn_start_write() >> >> >> >> and >> >> >> >> sibillings. I guess it was introduced just to cater UFS snapshot >> >> >> >> creation and then it poisoned other consumers. >> >> >> > >> >> >> > vn_start_write() has nothing to do with filesystem code at all. >> >> >> > It is purely VFS layer operation, which shall not be called from >> >> >> > fs >> >> >> > code at all. vn_start_secondary_write() is sometimes useful for >> >> >> > the >> >> >> > filesystem itself. >> >> >> > >> >> >> > Suspension (not snapshotting) is very useful and allows to avoid >> >> >> > some >> >> >> > nasty issues with unmounts, remounts or guaranteed syncing of the >> >> >> > filesystem. The fact that only UFS utilizes this functionality >> >> >> > just >> >> >> > shows that other filesystem implementors do not care about this >> >> >> > correctness, or that other filesystems are not maintained. >> >> >> >> >> >> I'm sure that when I looked into it only UFS suspension was being >> >> >> touched by it and it was introduced back in the days when >> >> >> snapshotting >> >> >> was sanitized. >> >> >> >> >> >> So what are the races it is supposed to fix and other filesystems >> >> >> don't care about? >> >> > >> >> > You cannot reliably sync the filesystem when other writers are >> >> > active. >> >> > So, for instance, loop over vnodes fsyncing them in unmount code can >> >> > never >> >> > terminate. The same is true for remounts rw->ro. >> >> > >> >> > One of the possible solution there is to suspend writers. If unmount >> >> > is >> >> > successfull, writer will get a failure from vn_start_write() call, >> >> > while >> >> > it will proceed normal if unmount is terminated or not started at >> >> > all. >> >> >> >> I don't think we implement that right now, IIRC, but it is an >> >> interesting >> >> idea. >> > >> > What don't we implement right now ? Take a look at r183074 (Sep 2008). >> >> Ah sorry, I looked into it before 2008 effectively (and that also >> reminds me why I stopped working on removing that primitive from VFS >> and make it UFS specific one) :) >> >> However why we cannot make a fix like that in domount()/dounmount() >> directly for every R/W filesystem? > At least, the filesystem needs to implement the VFS_SUSP_CLEAN VFS op. > The purpose of the operation is to clean up after suspension, e.g. > in the UFS case, VFS_SUSP_CLEAN removes unlinked files which reference > count went to 0 during suspension, as well as process delayed atime > updating. > > Another issue that I see is handling of filesystems that offload i/o to > several threads. The unmount thread is given special rights to perform > i/o while filesystem is suspended, but VFS cannot know about other threads > that shall be permitted to perform writes. > > At least those are two issues that appeared during applying the suspension > to UFS unmount and which I remember. > > With all this complications, suspension is provided in a form of library > for use by filesystem implementors, and not as a mandatory feature of VFS. It makes sense, thanks for explaining the issues you found while implementing this trick on UFS. Attilio -- Peace can only be achieved by understanding - A. Einstein