From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 15:16:51 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 88E19106566C; Thu, 1 Mar 2012 15:16:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id C04938FC1E; Thu, 1 Mar 2012 15:16:50 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21FGhhU097853; Thu, 1 Mar 2012 17:16:43 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q21FGgar075099; Thu, 1 Mar 2012 17:16:42 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21FGglh075098; Thu, 1 Mar 2012 17:16:42 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 1 Mar 2012 17:16:42 +0200 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120301151642.GY55074@deviant.kiev.zoral.com.ua> References: <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> <20120301150125.GX55074@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="k/PDUuKPvLVdBXpq" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 15:16:51 -0000 --k/PDUuKPvLVdBXpq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote: > 2012/3/1, Konstantin Belousov : > > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: > >> 2012/3/1, Konstantin Belousov : > >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: > >> >> 2012/3/1, Pawel Jakub Dawidek : > >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > >> >> >> > - "Every file system needs cache. Let's make it general, so th= at > >> >> >> > all > >> >> >> > file > >> >> >> > systems can use it!" Well, for VFS each file system is a > >> >> >> > separate > >> >> >> > entity, which is not the case for ZFS. ZFS can cache one blo= ck > >> >> >> > only > >> >> >> > once that is used by one file system, 10 clones and 100 > >> >> >> > snapshots, > >> >> >> > which all are separate mount points from VFS perspective. > >> >> >> > The same block would be cached 111 times by the buffer cache. > >> >> >> > >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call > >> >> >> cache_entry() on your own), add a number of cache_prune calls. I= t's > >> >> >> pretty much library-like design you describe below. > >> >> > > >> >> > Yes, namecache is already library-like, but I was talking about t= he > >> >> > buffer cache. I managed to bypass it eventually with suggestions = from > >> >> > ups@, but for a long time I was sure it isn't at all possible. > >> >> > >> >> Can you please clarify on this as I really don't understand what you > >> >> mean? > >> >> > >> >> > > >> >> >> Everybody agrees that VFS needs more care. But there haven't been > >> >> >> much > >> >> >> of concrete suggestions or at least there is no VFS TODO list. > >> >> > > >> >> > Everybody agrees on that, true, but we disagree on the direction = we > >> >> > should move our VFS, ie. make it more light-weight vs. more > >> >> > heavy-weight. > >> >> > >> >> All I'm saying (and Gleb too) is that I don't see any benefit in > >> >> replicating all the vnodes lifecycle at the inode level and in the > >> >> filesystem specific implementation. > >> >> I don't see a semplification in the work to do, I don't think this = is > >> >> going to be simpler for a single specific filesystem (without > >> >> mentioning the legacy support, which means re-implement inode handl= ing > >> >> for every filesystem we have now), we just loose generality. > >> >> > >> >> if you want a good example of a VFS primitive that was really > >> >> UFS-centric and it was mistakenly made generic is vn_start_write() = and > >> >> sibillings. I guess it was introduced just to cater UFS snapshot > >> >> creation and then it poisoned other consumers. > >> > > >> > vn_start_write() has nothing to do with filesystem code at all. > >> > It is purely VFS layer operation, which shall not be called from fs > >> > code at all. vn_start_secondary_write() is sometimes useful for the > >> > filesystem itself. > >> > > >> > Suspension (not snapshotting) is very useful and allows to avoid some > >> > nasty issues with unmounts, remounts or guaranteed syncing of the > >> > filesystem. The fact that only UFS utilizes this functionality just > >> > shows that other filesystem implementors do not care about this > >> > correctness, or that other filesystems are not maintained. > >> > >> I'm sure that when I looked into it only UFS suspension was being > >> touched by it and it was introduced back in the days when snapshotting > >> was sanitized. > >> > >> So what are the races it is supposed to fix and other filesystems > >> don't care about? > > > > You cannot reliably sync the filesystem when other writers are active. > > So, for instance, loop over vnodes fsyncing them in unmount code can ne= ver > > terminate. The same is true for remounts rw->ro. > > > > One of the possible solution there is to suspend writers. If unmount is > > successfull, writer will get a failure from vn_start_write() call, while > > it will proceed normal if unmount is terminated or not started at all. >=20 > I don't think we implement that right now, IIRC, but it is an interesting= idea. What don't we implement right now ? Take a look at r183074 (Sep 2008). --k/PDUuKPvLVdBXpq Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9PktoACgkQC3+MBN1Mb4gFwQCfaxSZ9pfQ+PsYYQmWry7vDHCp tykAnjplVq3pEMugDE19Yffjtw2mu4j3 =9++M -----END PGP SIGNATURE----- --k/PDUuKPvLVdBXpq--