From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 17:05:02 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 4E141106564A;
	Thu,  1 Mar 2012 17:05:02 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-vx0-f182.google.com (mail-vx0-f182.google.com
	[209.85.220.182])
	by mx1.freebsd.org (Postfix) with ESMTP id C9B5A8FC13;
	Thu,  1 Mar 2012 17:05:01 +0000 (UTC)
Received: by vcbfl15 with SMTP id fl15so821083vcb.13
	for <multiple recipients>; Thu, 01 Mar 2012 09:05:01 -0800 (PST)
Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates
	10.52.99.169 as permitted sender) client-ip=10.52.99.169; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of asmrookie@gmail.com
	designates 10.52.99.169 as permitted sender)
	smtp.mail=asmrookie@gmail.com;
	dkim=pass header.i=asmrookie@gmail.com
Received: from mr.google.com ([10.52.99.169])
	by 10.52.99.169 with SMTP id er9mr9140528vdb.126.1330621501308
	(num_hops = 1); Thu, 01 Mar 2012 09:05:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=FFAhnUi55uRUY3tqk6a/YHI48a5bXe4jvv+rNuyj3uE=;
	b=Gxm0UCPxGjPU38mVxYBbylHdr+PPePT9zXJtuT9+AIFf/R0ywwFXg5NwaTk92Lri+h
	B2eqnYlZkxJDoW7E7xjUE0uXSDtwNbCJqJlogeTR6jLV2Ij2vGnv4OWBynHr0w6z73TO
	QoRoGLyR6Yp0uwAHrQ3o1VDG4UHpZZLGydWgY=
MIME-Version: 1.0
Received: by 10.52.99.169 with SMTP id er9mr7754144vdb.126.1330621501096; Thu,
	01 Mar 2012 09:05:01 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.220.38.72 with HTTP; Thu, 1 Mar 2012 09:05:00 -0800 (PST)
In-Reply-To: <20120301153541.GZ55074@deviant.kiev.zoral.com.ua>
References: <20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
	<20120301144708.GV55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
	<20120301150125.GX55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndA=ETSTLCxG1=6G4D0ypaqQB7pDiC=VO==gDyz1BrRWFA@mail.gmail.com>
	<20120301151642.GY55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndCoKO9ejs+tAjVDMfeg18n4rYxTD8qPZgCXdccdKqV+8A@mail.gmail.com>
	<20120301153541.GZ55074@deviant.kiev.zoral.com.ua>
Date: Thu, 1 Mar 2012 17:05:00 +0000
X-Google-Sender-Auth: o5p0MzBSUTUxIxtcQxgyi0p3mjw
Message-ID: <CAJ-FndAAnK9nTVMKz9ONJbXWe73A_MZ=VuVq-4gOzE7hcc9ibg@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 17:05:02 -0000

2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> On Thu, Mar 01, 2012 at 03:23:21PM +0000, Attilio Rao wrote:
>> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> > On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote:
>> >> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> >> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote:
>> >> >> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> >> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
>> >> >> >> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
>> >> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
>> >> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
>> >> >> >> >> > - "Every file system needs cache. Let's make it general, so
>> >> >> >> >> > that
>> >> >> >> >> > all
>> >> >> >> >> > file
>> >> >> >> >> >   systems can use it!" Well, for VFS each file system is a
>> >> >> >> >> > separate
>> >> >> >> >> >   entity, which is not the case for ZFS. ZFS can cache one
>> >> >> >> >> > block
>> >> >> >> >> > only
>> >> >> >> >> >   once that is used by one file system, 10 clones and 100
>> >> >> >> >> > snapshots,
>> >> >> >> >> >   which all are separate mount points from VFS perspective.
>> >> >> >> >> >   The same block would be cached 111 times by the buffer
>> >> >> >> >> > cache.
>> >> >> >> >>
>> >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
>> >> >> >> >> cache_entry() on your own), add a number of cache_prune calls.
>> >> >> >> >> It's
>> >> >> >> >> pretty much library-like design you describe below.
>> >> >> >> >
>> >> >> >> > Yes, namecache is already library-like, but I was talking about
>> >> >> >> > the
>> >> >> >> > buffer cache. I managed to bypass it eventually with
>> >> >> >> > suggestions
>> >> >> >> > from
>> >> >> >> > ups@, but for a long time I was sure it isn't at all possible.
>> >> >> >>
>> >> >> >> Can you please clarify on this as I really don't understand what
>> >> >> >> you
>> >> >> >> mean?
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> Everybody agrees that VFS needs more care. But there haven't
>> >> >> >> >> been
>> >> >> >> >> much
>> >> >> >> >> of concrete suggestions or at least there is no VFS TODO list.
>> >> >> >> >
>> >> >> >> > Everybody agrees on that, true, but we disagree on the
>> >> >> >> > direction
>> >> >> >> > we
>> >> >> >> > should move our VFS, ie. make it more light-weight vs. more
>> >> >> >> > heavy-weight.
>> >> >> >>
>> >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in
>> >> >> >> replicating all the vnodes lifecycle at the inode level and in
>> >> >> >> the
>> >> >> >> filesystem specific implementation.
>> >> >> >> I don't see a semplification in the work to do, I don't think
>> >> >> >> this
>> >> >> >> is
>> >> >> >> going to be simpler for a single specific filesystem (without
>> >> >> >> mentioning the legacy support, which means re-implement inode
>> >> >> >> handling
>> >> >> >> for every filesystem we have now), we just loose generality.
>> >> >> >>
>> >> >> >> if you want a good example of a VFS primitive that was really
>> >> >> >> UFS-centric and it was mistakenly made generic is
>> >> >> >> vn_start_write()
>> >> >> >> and
>> >> >> >> sibillings. I guess it was introduced just to cater UFS snapshot
>> >> >> >> creation and then it poisoned other consumers.
>> >> >> >
>> >> >> > vn_start_write() has nothing to do with filesystem code at all.
>> >> >> > It is purely VFS layer operation, which shall not be called from
>> >> >> > fs
>> >> >> > code at all. vn_start_secondary_write() is sometimes useful for
>> >> >> > the
>> >> >> > filesystem itself.
>> >> >> >
>> >> >> > Suspension (not snapshotting) is very useful and allows to avoid
>> >> >> > some
>> >> >> > nasty issues with unmounts, remounts or guaranteed syncing of the
>> >> >> > filesystem. The fact that only UFS utilizes this functionality
>> >> >> > just
>> >> >> > shows that other filesystem implementors do not care about this
>> >> >> > correctness, or that other filesystems are not maintained.
>> >> >>
>> >> >> I'm sure that when I looked into it only UFS suspension was being
>> >> >> touched by it and it was introduced back in the days when
>> >> >> snapshotting
>> >> >> was sanitized.
>> >> >>
>> >> >> So what are the races it is supposed to fix and other filesystems
>> >> >> don't care about?
>> >> >
>> >> > You cannot reliably sync the filesystem when other writers are
>> >> > active.
>> >> > So, for instance, loop over vnodes fsyncing them in unmount code can
>> >> > never
>> >> > terminate. The same is true for remounts rw->ro.
>> >> >
>> >> > One of the possible solution there is to suspend writers. If unmount
>> >> > is
>> >> > successfull, writer will get a failure from vn_start_write() call,
>> >> > while
>> >> > it will proceed normal if unmount is terminated or not started at
>> >> > all.
>> >>
>> >> I don't think we implement that right now, IIRC, but it is an
>> >> interesting
>> >> idea.
>> >
>> > What don't we implement right now ? Take a look at r183074 (Sep 2008).
>>
>> Ah sorry, I looked into it before 2008 effectively (and that also
>> reminds me why I stopped working on removing that primitive from VFS
>> and make it UFS specific one) :)
>>
>> However why we cannot make a fix like that in domount()/dounmount()
>> directly for every R/W filesystem?
> At least, the filesystem needs to implement the VFS_SUSP_CLEAN VFS op.
> The purpose of the operation is to clean up after suspension, e.g.
> in the UFS case, VFS_SUSP_CLEAN removes unlinked files which reference
> count went to 0 during suspension, as well as process delayed atime
> updating.
>
> Another issue that I see is handling of filesystems that offload i/o to
> several threads. The unmount thread is given special rights to perform
> i/o while filesystem is suspended, but VFS cannot know about other threads
> that shall be permitted to perform writes.
>
> At least those are two issues that appeared during applying the suspension
> to UFS unmount and which I remember.
>
> With all this complications, suspension is provided in a form of library
> for use by filesystem implementors, and not as a mandatory feature of VFS.

It makes sense, thanks for explaining the issues you found while
implementing this trick on UFS.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein