From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 15:35:58 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B2A4E1065676; Thu, 1 Mar 2012 15:35:58 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 0C5398FC1E; Thu, 1 Mar 2012 15:35:57 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21FZfl6099329; Thu, 1 Mar 2012 17:35:41 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q21FZfFQ075215; Thu, 1 Mar 2012 17:35:41 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21FZfFu075214; Thu, 1 Mar 2012 17:35:41 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 1 Mar 2012 17:35:41 +0200 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120301153541.GZ55074@deviant.kiev.zoral.com.ua> References: <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> <20120301150125.GX55074@deviant.kiev.zoral.com.ua> <20120301151642.GY55074@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="JjEBCMAGNkRv8xbT" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 15:35:58 -0000 --JjEBCMAGNkRv8xbT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 03:23:21PM +0000, Attilio Rao wrote: > 2012/3/1, Konstantin Belousov : > > On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote: > >> 2012/3/1, Konstantin Belousov : > >> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: > >> >> 2012/3/1, Konstantin Belousov : > >> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: > >> >> >> 2012/3/1, Pawel Jakub Dawidek : > >> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > >> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > >> >> >> >> > - "Every file system needs cache. Let's make it general, so > >> >> >> >> > that > >> >> >> >> > all > >> >> >> >> > file > >> >> >> >> > systems can use it!" Well, for VFS each file system is a > >> >> >> >> > separate > >> >> >> >> > entity, which is not the case for ZFS. ZFS can cache one > >> >> >> >> > block > >> >> >> >> > only > >> >> >> >> > once that is used by one file system, 10 clones and 100 > >> >> >> >> > snapshots, > >> >> >> >> > which all are separate mount points from VFS perspective. > >> >> >> >> > The same block would be cached 111 times by the buffer ca= che. > >> >> >> >> > >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call > >> >> >> >> cache_entry() on your own), add a number of cache_prune calls. > >> >> >> >> It's > >> >> >> >> pretty much library-like design you describe below. > >> >> >> > > >> >> >> > Yes, namecache is already library-like, but I was talking about > >> >> >> > the > >> >> >> > buffer cache. I managed to bypass it eventually with suggestio= ns > >> >> >> > from > >> >> >> > ups@, but for a long time I was sure it isn't at all possible. > >> >> >> > >> >> >> Can you please clarify on this as I really don't understand what= you > >> >> >> mean? > >> >> >> > >> >> >> > > >> >> >> >> Everybody agrees that VFS needs more care. But there haven't = been > >> >> >> >> much > >> >> >> >> of concrete suggestions or at least there is no VFS TODO list. > >> >> >> > > >> >> >> > Everybody agrees on that, true, but we disagree on the directi= on > >> >> >> > we > >> >> >> > should move our VFS, ie. make it more light-weight vs. more > >> >> >> > heavy-weight. > >> >> >> > >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in > >> >> >> replicating all the vnodes lifecycle at the inode level and in t= he > >> >> >> filesystem specific implementation. > >> >> >> I don't see a semplification in the work to do, I don't think th= is > >> >> >> is > >> >> >> going to be simpler for a single specific filesystem (without > >> >> >> mentioning the legacy support, which means re-implement inode > >> >> >> handling > >> >> >> for every filesystem we have now), we just loose generality. > >> >> >> > >> >> >> if you want a good example of a VFS primitive that was really > >> >> >> UFS-centric and it was mistakenly made generic is vn_start_write= () > >> >> >> and > >> >> >> sibillings. I guess it was introduced just to cater UFS snapshot > >> >> >> creation and then it poisoned other consumers. > >> >> > > >> >> > vn_start_write() has nothing to do with filesystem code at all. > >> >> > It is purely VFS layer operation, which shall not be called from = fs > >> >> > code at all. vn_start_secondary_write() is sometimes useful for t= he > >> >> > filesystem itself. > >> >> > > >> >> > Suspension (not snapshotting) is very useful and allows to avoid = some > >> >> > nasty issues with unmounts, remounts or guaranteed syncing of the > >> >> > filesystem. The fact that only UFS utilizes this functionality ju= st > >> >> > shows that other filesystem implementors do not care about this > >> >> > correctness, or that other filesystems are not maintained. > >> >> > >> >> I'm sure that when I looked into it only UFS suspension was being > >> >> touched by it and it was introduced back in the days when snapshott= ing > >> >> was sanitized. > >> >> > >> >> So what are the races it is supposed to fix and other filesystems > >> >> don't care about? > >> > > >> > You cannot reliably sync the filesystem when other writers are activ= e. > >> > So, for instance, loop over vnodes fsyncing them in unmount code can > >> > never > >> > terminate. The same is true for remounts rw->ro. > >> > > >> > One of the possible solution there is to suspend writers. If unmount= is > >> > successfull, writer will get a failure from vn_start_write() call, w= hile > >> > it will proceed normal if unmount is terminated or not started at al= l. > >> > >> I don't think we implement that right now, IIRC, but it is an interest= ing > >> idea. > > > > What don't we implement right now ? Take a look at r183074 (Sep 2008). >=20 > Ah sorry, I looked into it before 2008 effectively (and that also > reminds me why I stopped working on removing that primitive from VFS > and make it UFS specific one) :) >=20 > However why we cannot make a fix like that in domount()/dounmount() > directly for every R/W filesystem? At least, the filesystem needs to implement the VFS_SUSP_CLEAN VFS op. The purpose of the operation is to clean up after suspension, e.g. in the UFS case, VFS_SUSP_CLEAN removes unlinked files which reference count went to 0 during suspension, as well as process delayed atime updating. Another issue that I see is handling of filesystems that offload i/o to several threads. The unmount thread is given special rights to perform i/o while filesystem is suspended, but VFS cannot know about other threads that shall be permitted to perform writes. At least those are two issues that appeared during applying the suspension to UFS unmount and which I remember. With all this complications, suspension is provided in a form of library for use by filesystem implementors, and not as a mandatory feature of VFS. --JjEBCMAGNkRv8xbT Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9Pl00ACgkQC3+MBN1Mb4i61gCfbNMsO6TQXa6gYB73u/0gKYjf leIAnRYbWi3DKaiOQD1fRnXzYM/gxM3b =h3Yh -----END PGP SIGNATURE----- --JjEBCMAGNkRv8xbT--