From owner-freebsd-arch@FreeBSD.ORG  Sun Feb 26 14:02:56 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7ECC6106566C;
	Sun, 26 Feb 2012 14:02:56 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 922998FC0C;
	Sun, 26 Feb 2012 14:02:55 +0000 (UTC)
Received: by lagz14 with SMTP id z14so6300630lag.13
	for <multiple recipients>; Sun, 26 Feb 2012 06:02:54 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=VkgoBzXDFoYv5pvvupxftiespIM+8ejQqbPGy3xiNOs=;
	b=A5cAa6FJS0ZJWWEB+OKz7ScLmtoY6HqbjVo3mF5q5iJJkZkhloUbUjAuW3C39T22BK
	E76eVQy2+tZBQFyckK3sxi2dsF9q19qFrX9daEDNPl9N0PtI7C4hOQ9m4T+Kk6lLH7NJ
	7/fW21rhdMBYY6dSc7J+AHo/UqLCYvk/SqpOo=
MIME-Version: 1.0
Received: by 10.152.130.234 with SMTP id oh10mr7407327lab.35.1330264974466;
	Sun, 26 Feb 2012 06:02:54 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Sun, 26 Feb 2012 06:02:54 -0800 (PST)
In-Reply-To: <20120225210339.GM55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225210339.GM55074@deviant.kiev.zoral.com.ua>
Date: Sun, 26 Feb 2012 15:02:54 +0100
X-Google-Sender-Auth: 1JO8JTL6BqDB7R3Hz9TRRKxu-uQ
Message-ID: <CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: arch@freebsd.org, Florian Smeets <flo@freebsd.org>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Feb 2012 14:02:56 -0000

Il 25 febbraio 2012 22:03, Konstantin Belousov <kostikbel@gmail.com> ha scritto:
> On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
>> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha scritto:
>> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote:
>> >> Il 03 febbraio 2012 19:37, Konstantin Belousov <kostikbel@gmail.com> ha scritto:
>> >> > FreeBSD I/O infrastructure has well known issue with deadlock caused
>> >> > by vnode lock order reversal when buffers supplied to read(2) or
>> >> > write(2) syscalls are backed by mmaped file.
>> >> >
>> >> > I previously published the patches to convert i/o path to use VMIO,
>> >> > based on the Jeff Roberson proposal, see
>> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
>> >> > deadlock. Since that work is very intrusive and did not got any
>> >> > follow-up, it get stalled.
>> >> >
>> >> > Below is very lightweight patch which only goal is to fix deadlock in
>> >> > the least intrusive way. This is possible after FreeBSD got the
>> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
>> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch
>> >>
>> >> Hi,
>> >> I was reviewing:
>> >> http://people.freebsd.org/~kib/misc/vm1.11.patch
>> >>
>> >> and I think it is great. It is simple enough and I don't have further
>> >> comments on it.
> Thank you.
>
> This spoiled an announce I intended to send this weekend :)
>
>> >>
>> >> However, as a side note, I was thinking if we could get one day at the
>> >> point to integrate rangelocks into vnodes lockmgr directly.
>> >> It would be a huge patch, rewrtiting the locking of several members of
>> >> vnodes likely, but I think it would be worth it in terms of cleaness
>> >> of the interface and less overhead. Also, it would be interesting to
>> >> consider merging rangelock implementation in ZFS' one, at some point.
>> >
>> > I personal opinion about rangelocks and many other VFS features we
>> > currently have is that it is good idea in theory, but in practise it
>> > tends to overcomplicate VFS.
>> >
>> > I'm in opinion that we should move as much stuff as we can to individual
>> > file systems. We try to implement everything in VFS itself in hope that
>> > this will simplify file systems we have. It then turns out only one file
>> > system is really using this stuff (most of the time it is UFS) and this
>> > is PITA for all the other file systems as well as maintaining VFS. VFS
>> > became so complicated over the years that there are maybe few people
>> > that can understand it, and every single change to VFS is a huge risk of
>> > potentially breaking some unrelated parts.
>>
>> I think this is questionable due to the following assets:
>> - If the problem is filesystems writers having trouble in
>> understanding the necessary locking we should really provide cleaner
>> and more complete documentation. One would think the same with our VM
>> subsystem, but at least in that case there is plenty of comments that
>> help understanding how to deal with vm_object, vm_pages locking during
>> their lifelines.
>> - Our primitives may be more complicated than the
>> 'all-in-the-filesystem' one, but at least they offer a complete and
>> centralized view over the resources we have allocated in the whole
>> system and they allow building better policies about how to manage
>> them. One problem I see here, is that those policies are not fully
>> implemented, tuned or just got outdated, removing one of the highest
>> beneficial that we have by making vnodes so generic
>>
>> About the thing I mentioned myself:
>> - As long as the same path now has both range-locking and vnode
>> locking I don't see as a good idea to keep both separated forever.
>> Merging them seems to me an important evolution (not only helping
>> shrinking the number of primitives themselves but also introducing
>> less overhead and likely rewamped scalability for vnodes (but I think
>> this needs a deep investigation).
> The proper direction to move there is to designate the vnode lock for
> the vnode structure protection, and have the range lock protect the
> i/o atomicity. This is somewhat done in the proposed patch (since
> now vnode lock does not protect the i/o operation, but only chunked
> i/o transactions inside the operation).
>
> The Jeff idea of using page cache as the source of i/o data (implemented
> in the VM6 patchset) pushes the idea much further. E.g., the write
> does not obtain the write vnode lock typically (but sometimes it had,
> to extend the vnode).
>
> Probably, I will revive VM6 after this change is landed.

About that I guess we might be careful.
The first thing would be having a very scalable VM subsystem and
recent benchmarks have shown that this is not yet the case (Florian,
CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also
with the vmcontention patch, shows a lot on contention on vm_object,
pmap lock and vm_page_queue_lock. We have some plans for every of
them, we will discuss on a separate thread if you prefer). This is
just to say, that we may need more work in underground areas to bring
VM6 to the point it will really make a difference.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Sun Feb 26 14:04:21 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5A677106567A;
	Sun, 26 Feb 2012 14:04:21 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id A44E08FC1C;
	Sun, 26 Feb 2012 14:04:20 +0000 (UTC)
Received: by lagz14 with SMTP id z14so6301703lag.13
	for <multiple recipients>; Sun, 26 Feb 2012 06:04:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	bh=wyVdTHiwTYnwxOG2gKRHeGl992/uLImcXX664FXO5x4=;
	b=UYV+J5uilhPOr3P2+S2mNtz/rJfDemGRqhOFIUTY2fryxNgVe26zIwHsNSsSNPa01J
	JlZO0FOKi6618Z23bpVTlFllkrEY6PdQfGyyJY1Hxefsd+tOQoMKeC59fWWX6T17MQrE
	b/QBTzTrtmhP7Lf2B1hWxCzFe0LZSAKQLnJco=
MIME-Version: 1.0
Received: by 10.112.27.199 with SMTP id v7mr3412896lbg.36.1330265059463; Sun,
	26 Feb 2012 06:04:19 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Sun, 26 Feb 2012 06:04:19 -0800 (PST)
In-Reply-To: <20120225194630.GI1344@garage.freebsd.pl>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
Date: Sun, 26 Feb 2012 15:04:19 +0100
X-Google-Sender-Auth: o5D1MLltHuoq3NS-UbW0zmBkRC0
Message-ID: <CAJ-FndBp9Eb5vVibXoLTLYCOELxJtDKY56MwpA9Kyk=OhiuaQw@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Pawel Jakub Dawidek <pjd@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Cc: Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Feb 2012 14:04:21 -0000

Il 25 febbraio 2012 20:46, Pawel Jakub Dawidek <pjd@freebsd.org> ha scritto=
:
> On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
>> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha scri=
tto:
>> > I personal opinion about rangelocks and many other VFS features we
>> > currently have is that it is good idea in theory, but in practise it
>> > tends to overcomplicate VFS.
>> >
>> > I'm in opinion that we should move as much stuff as we can to individu=
al
>> > file systems. We try to implement everything in VFS itself in hope tha=
t
>> > this will simplify file systems we have. It then turns out only one fi=
le
>> > system is really using this stuff (most of the time it is UFS) and thi=
s
>> > is PITA for all the other file systems as well as maintaining VFS. VFS
>> > became so complicated over the years that there are maybe few people
>> > that can understand it, and every single change to VFS is a huge risk =
of
>> > potentially breaking some unrelated parts.
>>
>> I think this is questionable due to the following assets:
>> - If the problem is filesystems writers having trouble in
>> understanding the necessary locking we should really provide cleaner
>> and more complete documentation. One would think the same with our VM
>> subsystem, but at least in that case there is plenty of comments that
>> help understanding how to deal with vm_object, vm_pages locking during
>> their lifelines.
>
> Documentation is not the answer here. If the code is so complex it is
> harder to learn, no matter how good the documentation is, it makes less
> people willing to learn it in the first place and it makes the code more
> buggy, because there are more edge/special cases you can forget about.
>
>> - Our primitives may be more complicated than the
>> 'all-in-the-filesystem' one, but at least they offer a complete and
>> centralized view over the resources we have allocated in the whole
>> system and they allow building better policies about how to manage
>> them. One problem I see here, is that those policies are not fully
>> implemented, tuned or just got outdated, removing one of the highest
>> beneficial that we have by making vnodes so generic
>
> Again, this is only nice theory, that is far from being the reality.
> You will never be able to have control on all the resources allocated by
> file systems.
>
>> About the thing I mentioned myself:
>> - As long as the same path now has both range-locking and vnode
>> locking I don't see as a good idea to keep both separated forever.
>> Merging them seems to me an important evolution (not only helping
>> shrinking the number of primitives themselves but also introducing
>> less overhead and likely rewamped scalability for vnodes (but I think
>> this needs a deep investigation).
>> - About ZFS rangelocks absorbing the VFS ones, I think this is a minor
>> point, but still, if you think it can be done efficiently and without
>> loosing performance I don't see why not do that. You already wrote
>> rangelocks for ZFS, so you are have earned a big experience in this
>> area and can comment on fallouts, etc., but I don't see a good reason
>> to not do that, unless it is just too difficult. This is not about
>> generalizing a new mechanism, it is using a general mechanism in a
>> specific implementation, if possible.
>
> I did not implement rangelocking for ZFS. It came with ZFS when I ported
> it. Until we want to merge changes from upstream (which is now IllumOS)
> we don't want to make huge changes just for the sake of proving that
> this is general purpose mechanism used by more than one file system.
>
> Attilio, don't get me wrong. In 99% cases it is good to make code more
> general and more universal and reusable, but we can't ignore reality.
>
> There are reasons why file systems like XFS, ReiserFS and others where
> never fully ported. I'm not saying VFS complexity was the only reason,
> but I'm sure it was one of them.
>
> Our VFS is very UFS-centric. We make so many assumptions that sounds
> fine only for UFS. I saw plenty of those while working on ZFS, like:
>
> - "Every file system needs cache. Let's make it general, so that all file
> =C2=A0systems can use it!" Well, for VFS each file system is a separate
> =C2=A0entity, which is not the case for ZFS. ZFS can cache one block only
> =C2=A0once that is used by one file system, 10 clones and 100 snapshots,
> =C2=A0which all are separate mount points from VFS perspective.
> =C2=A0The same block would be cached 111 times by the buffer cache.
>
> - "rmdir(2) on a mountpoint is bad idea, let's deny it at VFS level."
> =C2=A0It is bad idea, indeed, but in ZFS it is a nice way to remove snaps=
hot
> =C2=A0by rmdiring .zfs/snapshot/<name> directory.
>
> - Noone implemented rangelocking in VFS, so no file system can use it.
> =C2=A0Even if the given file system has all the code to do it.
>
> etc.
>
> I'm also sure it will be way easier for Jeff to make VFS MP-safe if it
> was less complex.
>
> When looking at the big picture, it would be nice to have all this
> general stuff like rangelocking, quota, buffer cache, etc. as some kind
> of libraries for file systems to use and not something that is
> mandatory. If I develop a file system for FreeBSD only and I don't want
> to reinvent the wheel, I can use those libraries. If I port file system
> to FreeBSD or develop a file system that doesn't really need those
> libraries I'm not forced to use them.
>
> All this might make a good working group subject at BSDCan devsummit.
> We could cross swords there:)

Do you think you will be able to chair such a group?
I'm not sure I will be able to make it for BSDCan, but it would be
valuable if you or someone else interested can let the ball roll on
these topics.

Thanks,
Attilio


--=20
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Sun Feb 26 14:13:39 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 68652106564A;
	Sun, 26 Feb 2012 14:13:39 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id ECF8D8FC0A;
	Sun, 26 Feb 2012 14:13:38 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q1QEDYW4068821;
	Sun, 26 Feb 2012 16:13:34 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q1QEDYZe027147; Sun, 26 Feb 2012 16:13:34 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q1QEDYC5027146; 
	Sun, 26 Feb 2012 16:13:34 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Sun, 26 Feb 2012 16:13:34 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Message-ID: <20120226141334.GU55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225210339.GM55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="hZWqkIq97iJ4fJXE"
Content-Disposition: inline
In-Reply-To: <CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: arch@freebsd.org, Florian Smeets <flo@freebsd.org>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Feb 2012 14:13:39 -0000


--hZWqkIq97iJ4fJXE
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote:
> Il 25 febbraio 2012 22:03, Konstantin Belousov <kostikbel@gmail.com> ha s=
critto:
> > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
> >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha sc=
ritto:
> >> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote:
> >> >> Il 03 febbraio 2012 19:37, Konstantin Belousov <kostikbel@gmail.com=
> ha scritto:
> >> >> > FreeBSD I/O infrastructure has well known issue with deadlock cau=
sed
> >> >> > by vnode lock order reversal when buffers supplied to read(2) or
> >> >> > write(2) syscalls are backed by mmaped file.
> >> >> >
> >> >> > I previously published the patches to convert i/o path to use VMI=
O,
> >> >> > based on the Jeff Roberson proposal, see
> >> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> >> >> > deadlock. Since that work is very intrusive and did not got any
> >> >> > follow-up, it get stalled.
> >> >> >
> >> >> > Below is very lightweight patch which only goal is to fix deadloc=
k in
> >> >> > the least intrusive way. This is possible after FreeBSD got the
> >> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) K=
PIs.
> >> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch
> >> >>
> >> >> Hi,
> >> >> I was reviewing:
> >> >> http://people.freebsd.org/~kib/misc/vm1.11.patch
> >> >>
> >> >> and I think it is great. It is simple enough and I don't have furth=
er
> >> >> comments on it.
> > Thank you.
> >
> > This spoiled an announce I intended to send this weekend :)
> >
> >> >>
> >> >> However, as a side note, I was thinking if we could get one day at =
the
> >> >> point to integrate rangelocks into vnodes lockmgr directly.
> >> >> It would be a huge patch, rewrtiting the locking of several members=
 of
> >> >> vnodes likely, but I think it would be worth it in terms of cleaness
> >> >> of the interface and less overhead. Also, it would be interesting to
> >> >> consider merging rangelock implementation in ZFS' one, at some poin=
t.
> >> >
> >> > I personal opinion about rangelocks and many other VFS features we
> >> > currently have is that it is good idea in theory, but in practise it
> >> > tends to overcomplicate VFS.
> >> >
> >> > I'm in opinion that we should move as much stuff as we can to indivi=
dual
> >> > file systems. We try to implement everything in VFS itself in hope t=
hat
> >> > this will simplify file systems we have. It then turns out only one =
file
> >> > system is really using this stuff (most of the time it is UFS) and t=
his
> >> > is PITA for all the other file systems as well as maintaining VFS. V=
FS
> >> > became so complicated over the years that there are maybe few people
> >> > that can understand it, and every single change to VFS is a huge ris=
k of
> >> > potentially breaking some unrelated parts.
> >>
> >> I think this is questionable due to the following assets:
> >> - If the problem is filesystems writers having trouble in
> >> understanding the necessary locking we should really provide cleaner
> >> and more complete documentation. One would think the same with our VM
> >> subsystem, but at least in that case there is plenty of comments that
> >> help understanding how to deal with vm_object, vm_pages locking during
> >> their lifelines.
> >> - Our primitives may be more complicated than the
> >> 'all-in-the-filesystem' one, but at least they offer a complete and
> >> centralized view over the resources we have allocated in the whole
> >> system and they allow building better policies about how to manage
> >> them. One problem I see here, is that those policies are not fully
> >> implemented, tuned or just got outdated, removing one of the highest
> >> beneficial that we have by making vnodes so generic
> >>
> >> About the thing I mentioned myself:
> >> - As long as the same path now has both range-locking and vnode
> >> locking I don't see as a good idea to keep both separated forever.
> >> Merging them seems to me an important evolution (not only helping
> >> shrinking the number of primitives themselves but also introducing
> >> less overhead and likely rewamped scalability for vnodes (but I think
> >> this needs a deep investigation).
> > The proper direction to move there is to designate the vnode lock for
> > the vnode structure protection, and have the range lock protect the
> > i/o atomicity. This is somewhat done in the proposed patch (since
> > now vnode lock does not protect the i/o operation, but only chunked
> > i/o transactions inside the operation).
> >
> > The Jeff idea of using page cache as the source of i/o data (implemented
> > in the VM6 patchset) pushes the idea much further. E.g., the write
> > does not obtain the write vnode lock typically (but sometimes it had,
> > to extend the vnode).
> >
> > Probably, I will revive VM6 after this change is landed.
>=20
> About that I guess we might be careful.
> The first thing would be having a very scalable VM subsystem and
> recent benchmarks have shown that this is not yet the case (Florian,
> CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also
> with the vmcontention patch, shows a lot on contention on vm_object,
> pmap lock and vm_page_queue_lock. We have some plans for every of
> them, we will discuss on a separate thread if you prefer). This is
> just to say, that we may need more work in underground areas to bring
> VM6 to the point it will really make a difference.

The benchmarks that were done at that time demonstrated that VM6 do not
cause regressions for e.g. buildworld time, and have a margin improvements,
around 10%, for some postgresql loads.

Main benefit of the VM6 on UFS is that writers no longer block readers
for separate i/o ranges. Also, due to vm_page flags locking improvements,
I suspect the VM6 backpressure code might be simplified and give even
larger benefit right now.

Anyway, I do not think that VM6 can be put into HEAD quickly, and I want
to finish with VM1/prefaulting right now.

--hZWqkIq97iJ4fJXE
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk9KPg4ACgkQC3+MBN1Mb4jgeQCgmjogiXqR8U7bZcOJ50tiEfb1
vi4An0XaOgTsNFD0GGIGbVqPw0kOUB+I
=ykEh
-----END PGP SIGNATURE-----

--hZWqkIq97iJ4fJXE--

From owner-freebsd-arch@FreeBSD.ORG  Sun Feb 26 14:16:15 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 50903106564A;
	Sun, 26 Feb 2012 14:16:15 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 59C518FC13;
	Sun, 26 Feb 2012 14:16:13 +0000 (UTC)
Received: by lagz14 with SMTP id z14so6311048lag.13
	for <multiple recipients>; Sun, 26 Feb 2012 06:16:13 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=oxsmmjsYVnKNZW19L/RK7s/wIqD4CQWUvJtpJUxc9Ro=;
	b=eTuD+09O3JgWRhyGXnOxHroAwGZXR1PbZbW5qEp23SYyYDt0PF+wGILGCgeX5j6+IL
	9a8WTNzCHy2HyRLCgJZ6holL2beygXqv/aedCFdB5px1L9BUPkcI9PGnCX8MO7O4Xnhs
	/Jy+GEcDu/0QNzMUtUtbLA0irmSZYkPlsoXxU=
MIME-Version: 1.0
Received: by 10.112.10.41 with SMTP id f9mr3456684lbb.8.1330265772956; Sun, 26
	Feb 2012 06:16:12 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Sun, 26 Feb 2012 06:16:12 -0800 (PST)
In-Reply-To: <20120226141334.GU55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225210339.GM55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw@mail.gmail.com>
	<20120226141334.GU55074@deviant.kiev.zoral.com.ua>
Date: Sun, 26 Feb 2012 14:16:12 +0000
X-Google-Sender-Auth: JiD3cWG7nSA1ai8Zuaj3MJFHco4
Message-ID: <CAJ-FndAme7Joe1hd05VbvmA4C7_9q26ZQncBKtdBBjGawHqrHQ@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: arch@freebsd.org, Florian Smeets <flo@freebsd.org>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Feb 2012 14:16:15 -0000

Il 26 febbraio 2012 14:13, Konstantin Belousov <kostikbel@gmail.com> ha scritto:
> On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote:
>> Il 25 febbraio 2012 22:03, Konstantin Belousov <kostikbel@gmail.com> ha scritto:
>> > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
>> >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha scritto:
>> >> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote:
>> >> >> Il 03 febbraio 2012 19:37, Konstantin Belousov <kostikbel@gmail.com> ha scritto:
>> >> >> > FreeBSD I/O infrastructure has well known issue with deadlock caused
>> >> >> > by vnode lock order reversal when buffers supplied to read(2) or
>> >> >> > write(2) syscalls are backed by mmaped file.
>> >> >> >
>> >> >> > I previously published the patches to convert i/o path to use VMIO,
>> >> >> > based on the Jeff Roberson proposal, see
>> >> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
>> >> >> > deadlock. Since that work is very intrusive and did not got any
>> >> >> > follow-up, it get stalled.
>> >> >> >
>> >> >> > Below is very lightweight patch which only goal is to fix deadlock in
>> >> >> > the least intrusive way. This is possible after FreeBSD got the
>> >> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
>> >> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch
>> >> >>
>> >> >> Hi,
>> >> >> I was reviewing:
>> >> >> http://people.freebsd.org/~kib/misc/vm1.11.patch
>> >> >>
>> >> >> and I think it is great. It is simple enough and I don't have further
>> >> >> comments on it.
>> > Thank you.
>> >
>> > This spoiled an announce I intended to send this weekend :)
>> >
>> >> >>
>> >> >> However, as a side note, I was thinking if we could get one day at the
>> >> >> point to integrate rangelocks into vnodes lockmgr directly.
>> >> >> It would be a huge patch, rewrtiting the locking of several members of
>> >> >> vnodes likely, but I think it would be worth it in terms of cleaness
>> >> >> of the interface and less overhead. Also, it would be interesting to
>> >> >> consider merging rangelock implementation in ZFS' one, at some point.
>> >> >
>> >> > I personal opinion about rangelocks and many other VFS features we
>> >> > currently have is that it is good idea in theory, but in practise it
>> >> > tends to overcomplicate VFS.
>> >> >
>> >> > I'm in opinion that we should move as much stuff as we can to individual
>> >> > file systems. We try to implement everything in VFS itself in hope that
>> >> > this will simplify file systems we have. It then turns out only one file
>> >> > system is really using this stuff (most of the time it is UFS) and this
>> >> > is PITA for all the other file systems as well as maintaining VFS. VFS
>> >> > became so complicated over the years that there are maybe few people
>> >> > that can understand it, and every single change to VFS is a huge risk of
>> >> > potentially breaking some unrelated parts.
>> >>
>> >> I think this is questionable due to the following assets:
>> >> - If the problem is filesystems writers having trouble in
>> >> understanding the necessary locking we should really provide cleaner
>> >> and more complete documentation. One would think the same with our VM
>> >> subsystem, but at least in that case there is plenty of comments that
>> >> help understanding how to deal with vm_object, vm_pages locking during
>> >> their lifelines.
>> >> - Our primitives may be more complicated than the
>> >> 'all-in-the-filesystem' one, but at least they offer a complete and
>> >> centralized view over the resources we have allocated in the whole
>> >> system and they allow building better policies about how to manage
>> >> them. One problem I see here, is that those policies are not fully
>> >> implemented, tuned or just got outdated, removing one of the highest
>> >> beneficial that we have by making vnodes so generic
>> >>
>> >> About the thing I mentioned myself:
>> >> - As long as the same path now has both range-locking and vnode
>> >> locking I don't see as a good idea to keep both separated forever.
>> >> Merging them seems to me an important evolution (not only helping
>> >> shrinking the number of primitives themselves but also introducing
>> >> less overhead and likely rewamped scalability for vnodes (but I think
>> >> this needs a deep investigation).
>> > The proper direction to move there is to designate the vnode lock for
>> > the vnode structure protection, and have the range lock protect the
>> > i/o atomicity. This is somewhat done in the proposed patch (since
>> > now vnode lock does not protect the i/o operation, but only chunked
>> > i/o transactions inside the operation).
>> >
>> > The Jeff idea of using page cache as the source of i/o data (implemented
>> > in the VM6 patchset) pushes the idea much further. E.g., the write
>> > does not obtain the write vnode lock typically (but sometimes it had,
>> > to extend the vnode).
>> >
>> > Probably, I will revive VM6 after this change is landed.
>>
>> About that I guess we might be careful.
>> The first thing would be having a very scalable VM subsystem and
>> recent benchmarks have shown that this is not yet the case (Florian,
>> CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also
>> with the vmcontention patch, shows a lot on contention on vm_object,
>> pmap lock and vm_page_queue_lock. We have some plans for every of
>> them, we will discuss on a separate thread if you prefer). This is
>> just to say, that we may need more work in underground areas to bring
>> VM6 to the point it will really make a difference.
>
> The benchmarks that were done at that time demonstrated that VM6 do not
> cause regressions for e.g. buildworld time, and have a margin improvements,
> around 10%, for some postgresql loads.
>
> Main benefit of the VM6 on UFS is that writers no longer block readers
> for separate i/o ranges. Also, due to vm_page flags locking improvements,
> I suspect the VM6 backpressure code might be simplified and give even
> larger benefit right now.
>
> Anyway, I do not think that VM6 can be put into HEAD quickly, and I want
> to finish with VM1/prefaulting right now.

I was speaking about a different benchmark.
Florian made a lock_profile/hwpmc analysis on stock + vmcontention
patch for verifying where the biggest bottlenecks are.
Of course, it turns out that the most contended locks are all the ones
involved in VM, which is not surprising at all.

He can share numbers and insight I guess.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Sun Feb 26 14:22:07 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 941C81065670;
	Sun, 26 Feb 2012 14:22:07 +0000 (UTC) (envelope-from flo@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 0D1138FC0C;
	Sun, 26 Feb 2012 14:22:07 +0000 (UTC)
Received: from nibbler-osx.fritz.box (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q1QEM4dd009659;
	Sun, 26 Feb 2012 14:22:05 GMT (envelope-from flo@FreeBSD.org)
Message-ID: <4F4A400C.1030606@FreeBSD.org>
Date: Sun, 26 Feb 2012 15:22:04 +0100
From: Florian Smeets <flo@FreeBSD.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
	rv:11.0) Gecko/20120216 Thunderbird/11.0
MIME-Version: 1.0
To: Attilio Rao <attilio@FreeBSD.org>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225210339.GM55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndDZpDXqDRR=kT_eQcHbeg3vdiUjnygy1=QLvVuumUsgBw@mail.gmail.com>
	<20120226141334.GU55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAme7Joe1hd05VbvmA4C7_9q26ZQncBKtdBBjGawHqrHQ@mail.gmail.com>
In-Reply-To: <CAJ-FndAme7Joe1hd05VbvmA4C7_9q26ZQncBKtdBBjGawHqrHQ@mail.gmail.com>
X-Enigmail-Version: 1.4a1pre
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature";
	boundary="------------enigBB70FF77484EE4C06FD5CE12"
Cc: Konstantin Belousov <kostikbel@gmail.com>, arch@FreeBSD.org,
	Pawel Jakub Dawidek <pjd@FreeBSD.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Feb 2012 14:22:07 -0000

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigBB70FF77484EE4C06FD5CE12
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 26.02.12 15:16, Attilio Rao wrote:
> Il 26 febbraio 2012 14:13, Konstantin Belousov <kostikbel@gmail.com> ha=
 scritto:
>> On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote:
>>> Il 25 febbraio 2012 22:03, Konstantin Belousov <kostikbel@gmail.com> =
ha scritto:
>>>> On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
>>>>> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha=
 scritto:
>>>>>> On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote:
>>>>>>> Il 03 febbraio 2012 19:37, Konstantin Belousov <kostikbel@gmail.c=
om> ha scritto:
>>>>>>>> FreeBSD I/O infrastructure has well known issue with deadlock ca=
used
>>>>>>>> by vnode lock order reversal when buffers supplied to read(2) or=

>>>>>>>> write(2) syscalls are backed by mmaped file.
>>>>>>>>
>>>>>>>> I previously published the patches to convert i/o path to use VM=
IO,
>>>>>>>> based on the Jeff Roberson proposal, see
>>>>>>>> http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the=

>>>>>>>> deadlock. Since that work is very intrusive and did not got any
>>>>>>>> follow-up, it get stalled.
>>>>>>>>
>>>>>>>> Below is very lightweight patch which only goal is to fix deadlo=
ck in
>>>>>>>> the least intrusive way. This is possible after FreeBSD got the
>>>>>>>> vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) =
KPIs.
>>>>>>>> http://people.freebsd.org/~kib/misc/vm1.3.patch
>>>>>>>
>>>>>>> Hi,
>>>>>>> I was reviewing:
>>>>>>> http://people.freebsd.org/~kib/misc/vm1.11.patch
>>>>>>>
>>>>>>> and I think it is great. It is simple enough and I don't have fur=
ther
>>>>>>> comments on it.
>>>> Thank you.
>>>>
>>>> This spoiled an announce I intended to send this weekend :)
>>>>
>>>>>>>
>>>>>>> However, as a side note, I was thinking if we could get one day a=
t the
>>>>>>> point to integrate rangelocks into vnodes lockmgr directly.
>>>>>>> It would be a huge patch, rewrtiting the locking of several membe=
rs of
>>>>>>> vnodes likely, but I think it would be worth it in terms of clean=
ess
>>>>>>> of the interface and less overhead. Also, it would be interesting=
 to
>>>>>>> consider merging rangelock implementation in ZFS' one, at some po=
int.
>>>>>>
>>>>>> I personal opinion about rangelocks and many other VFS features we=

>>>>>> currently have is that it is good idea in theory, but in practise =
it
>>>>>> tends to overcomplicate VFS.
>>>>>>
>>>>>> I'm in opinion that we should move as much stuff as we can to indi=
vidual
>>>>>> file systems. We try to implement everything in VFS itself in hope=
 that
>>>>>> this will simplify file systems we have. It then turns out only on=
e file
>>>>>> system is really using this stuff (most of the time it is UFS) and=
 this
>>>>>> is PITA for all the other file systems as well as maintaining VFS.=
 VFS
>>>>>> became so complicated over the years that there are maybe few peop=
le
>>>>>> that can understand it, and every single change to VFS is a huge r=
isk of
>>>>>> potentially breaking some unrelated parts.
>>>>>
>>>>> I think this is questionable due to the following assets:
>>>>> - If the problem is filesystems writers having trouble in
>>>>> understanding the necessary locking we should really provide cleane=
r
>>>>> and more complete documentation. One would think the same with our =
VM
>>>>> subsystem, but at least in that case there is plenty of comments th=
at
>>>>> help understanding how to deal with vm_object, vm_pages locking dur=
ing
>>>>> their lifelines.
>>>>> - Our primitives may be more complicated than the
>>>>> 'all-in-the-filesystem' one, but at least they offer a complete and=

>>>>> centralized view over the resources we have allocated in the whole
>>>>> system and they allow building better policies about how to manage
>>>>> them. One problem I see here, is that those policies are not fully
>>>>> implemented, tuned or just got outdated, removing one of the highes=
t
>>>>> beneficial that we have by making vnodes so generic
>>>>>
>>>>> About the thing I mentioned myself:
>>>>> - As long as the same path now has both range-locking and vnode
>>>>> locking I don't see as a good idea to keep both separated forever.
>>>>> Merging them seems to me an important evolution (not only helping
>>>>> shrinking the number of primitives themselves but also introducing
>>>>> less overhead and likely rewamped scalability for vnodes (but I thi=
nk
>>>>> this needs a deep investigation).
>>>> The proper direction to move there is to designate the vnode lock fo=
r
>>>> the vnode structure protection, and have the range lock protect the
>>>> i/o atomicity. This is somewhat done in the proposed patch (since
>>>> now vnode lock does not protect the i/o operation, but only chunked
>>>> i/o transactions inside the operation).
>>>>
>>>> The Jeff idea of using page cache as the source of i/o data (impleme=
nted
>>>> in the VM6 patchset) pushes the idea much further. E.g., the write
>>>> does not obtain the write vnode lock typically (but sometimes it had=
,
>>>> to extend the vnode).
>>>>
>>>> Probably, I will revive VM6 after this change is landed.
>>>
>>> About that I guess we might be careful.
>>> The first thing would be having a very scalable VM subsystem and
>>> recent benchmarks have shown that this is not yet the case (Florian,
>>> CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also
>>> with the vmcontention patch, shows a lot on contention on vm_object,
>>> pmap lock and vm_page_queue_lock. We have some plans for every of
>>> them, we will discuss on a separate thread if you prefer). This is
>>> just to say, that we may need more work in underground areas to bring=

>>> VM6 to the point it will really make a difference.
>>
>> The benchmarks that were done at that time demonstrated that VM6 do no=
t
>> cause regressions for e.g. buildworld time, and have a margin improvem=
ents,
>> around 10%, for some postgresql loads.
>>
>> Main benefit of the VM6 on UFS is that writers no longer block readers=

>> for separate i/o ranges. Also, due to vm_page flags locking improvemen=
ts,
>> I suspect the VM6 backpressure code might be simplified and give even
>> larger benefit right now.
>>
>> Anyway, I do not think that VM6 can be put into HEAD quickly, and I wa=
nt
>> to finish with VM1/prefaulting right now.
>=20
> I was speaking about a different benchmark.
> Florian made a lock_profile/hwpmc analysis on stock + vmcontention
> patch for verifying where the biggest bottlenecks are.
> Of course, it turns out that the most contended locks are all the ones
> involved in VM, which is not surprising at all.
>=20
> He can share numbers and insight I guess.

All i did until now is run PostgreSQL with 128 client threads with
lock_profiling [1] and hwpmc [2]. I haven't spent any time analyzing
this, yet.

[1]
http://people.freebsd.org/~flo/vmc-lock-profiling-postgres-128-20120208.t=
xt
[2] http://people.freebsd.org/~flo/vmc-hwpmc-gprof-postgres-128-20120208.=
txt


--------------enigBB70FF77484EE4C06FD5CE12
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----

iEYEARECAAYFAk9KQAwACgkQapo8P8lCvwl0mgCg2+4H30fWR7qt3g6iIxlYN28W
iNIAn2b6unvHqHukMX+Tdp8rtgn/4TP2
=jfVO
-----END PGP SIGNATURE-----

--------------enigBB70FF77484EE4C06FD5CE12--

From owner-freebsd-arch@FreeBSD.ORG  Sun Feb 26 14:52:22 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8926C1065672;
	Sun, 26 Feb 2012 14:52:22 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 01CA58FC08;
	Sun, 26 Feb 2012 14:52:21 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q1QEqHsn071070;
	Sun, 26 Feb 2012 16:52:17 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q1QEqHNc027413; Sun, 26 Feb 2012 16:52:17 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q1QEqH7W027412; 
	Sun, 26 Feb 2012 16:52:17 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Sun, 26 Feb 2012 16:52:17 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: arch@freebsd.org
Message-ID: <20120226145217.GV55074@deviant.kiev.zoral.com.ua>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="UjTPyfxZRWWsQMwo"
Content-Disposition: inline
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: 
Subject: Prefaulting for i/o buffers: v2.0
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 26 Feb 2012 14:52:22 -0000


--UjTPyfxZRWWsQMwo
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

I started a new thread since I do not want this message to interfere
with some side discussions caused by my initial letter.

I continued development and refinement of the original patch. The
latest version is available at
http://people.freebsd.org/~kib/misc/vm1.11.patch.

Since first announce, intermediate versions of the patch were reviewed
by attilio, mdf and pjd.  A version of the patch was tested and
benchmarked by flo, which apparently shown no difference for a
postgresql benchmark.

I fixed several shameful bugs, in particular, buildworld now
successfully finishes on the patched kernel and working directory both
on UFS and newnfs mounts. Apparently, bsdar(1) provides an excellent
functional test for the patch.

Most significant difference with previous variants is that now the use
of prefaulting is opt-in. I discovered that typical filesystem does
not handle uiomove() errors gracefully. Only UFS and newnfs are
switched to use prefaulting.

The newnfs client was changed to properly handle uiomove() failures
and to not cause user data loss on EFAULT (this is also applicable for
the stock svn sources). Corresponding changes were reviewed by
rmacklem.

My own feel is that vm1.11.patch is ready to be committed. This is a
notification to allow more people to take a look and provide the
pre-commit opinions. Thanks.

--UjTPyfxZRWWsQMwo
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk9KRyEACgkQC3+MBN1Mb4gGIQCgpj2DAyn+TCL0qAeENcoSahWs
wG0Anj/7RrFNMnQsngTAokRw27yduMk3
=sknu
-----END PGP SIGNATURE-----

--UjTPyfxZRWWsQMwo--

From owner-freebsd-arch@FreeBSD.ORG  Wed Feb 29 06:41:08 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C5D2C1065679
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 06:41:08 +0000 (UTC)
	(envelope-from delphij@delphij.net)
Received: from anubis.delphij.net (anubis.delphij.net
	[IPv6:2001:470:1:117::25])
	by mx1.freebsd.org (Postfix) with ESMTP id AB8A28FC17
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 06:41:08 +0000 (UTC)
Received: from delta.delphij.net (unknown
	[IPv6:2001:470:83bf:0:221:5cff:fe6a:37bb])
	(using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
	(No client certificate requested)
	by anubis.delphij.net (Postfix) with ESMTPSA id 14F3AF9A5;
	Tue, 28 Feb 2012 22:41:08 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=delphij.net; s=anubis;
	t=1330497668; bh=tKYlgOcAgDQ6XcYBEJ6T/VLDqQIabAuSUz1m1UAFerA=;
	h=Message-ID:Date:From:Reply-To:MIME-Version:To:CC:Subject:
	Content-Type:Content-Transfer-Encoding;
	b=nCRw5QIXTTzPt2NWSKibm7+SPUlmzzyOUex9v3e8tI2FJ7PsIw1gmL6Kzc8mkOuzY
	ZVPykFN4skpC4lzjxW0JWZMl1kqwgfIUCRHyPqje4KITGrTGyWbAoii/9LxduKoK60
	91cV8PELmMHEYMCugqRG3FK8L26yN6pexPTOkiGw=
Message-ID: <4F4DC876.3010809@delphij.net>
Date: Tue, 28 Feb 2012 22:40:54 -0800
From: Xin Li <delphij@delphij.net>
Organization: The FreeBSD Project
MIME-Version: 1.0
To: freebsd-arch@freebsd.org
X-Enigmail-Version: 1.3.5
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: d@delphij.net
Subject: RFC: futimens(2) and utimensat(2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: d@delphij.net
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Feb 2012 06:41:08 -0000

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi,

These are required by IEEE Std 1003.1-2008.  Patchset at:

http://people.freebsd.org/~delphij/for_review/utimens.diff

Cheers,
- -- 
Xin LI <delphij@delphij.net>	https://www.delphij.net/
FreeBSD - The Power to Serve!		Live free or die
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)

iQEcBAEBCAAGBQJPTch1AAoJEG80Jeu8UPuz75sH/3nyv6Lgdfa5MoF335u6H9zS
kjyBroOlV3pJ/2V2d+77fw/qt5PmMG+jPwVwCrQ55+ZuntG9wvrT+UNnY67lzV55
/otFzF8a6onvpe8HSX7JJOh6neeN8njQzfJDClbDPFJKKm778Qfebjes1s0zk1tp
JOvCf8bstXy02s0833sRW3HsfOh19f2KEPmKo2PXwgSrTGsLOWQqS7heFhszY5Hi
woRkxs9RYRzs1i3MzkBSDYB+KTOV6H+SUBln6w/HudHMBjvdvlUxpEpHjOzqbhax
bDE4QDljWY+3WK71Y48zEoEWO1P+jrbyciceIAWNF4RKmjSMeHMbnnTCFZFe+ZE=
=FUtH
-----END PGP SIGNATURE-----

From owner-freebsd-arch@FreeBSD.ORG  Wed Feb 29 11:51:23 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B33111065673;
	Wed, 29 Feb 2012 11:51:23 +0000 (UTC)
	(envelope-from pluknet@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 02C0D8FC0A;
	Wed, 29 Feb 2012 11:51:22 +0000 (UTC)
Received: by lagv3 with SMTP id v3so490337lag.13
	for <multiple recipients>; Wed, 29 Feb 2012 03:51:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type:content-transfer-encoding;
	bh=bntZeHk/+b/QEeAFwvT25WYIoly9nbR/TtiQT3Zr+Ek=;
	b=ZnJsDcD4bH2/5eBEGDtCIdAb+Ey33uUeZ5NSRjMOK1x8/+YzUeOFx9RQFdBzoqpHEC
	bieC7ZFA1wGu+xtT+MR8/khf+pvQ5sh99wWcfKbv+Tl2IsMBHrLL0TmKrLbIwycifH13
	S0cyg8K+IAEdr84siHX4kC/veIEhF0QCRw/Q8=
MIME-Version: 1.0
Received: by 10.152.135.148 with SMTP id ps20mr14686919lab.20.1330514483732;
	Wed, 29 Feb 2012 03:21:23 -0800 (PST)
Received: by 10.152.108.204 with HTTP; Wed, 29 Feb 2012 03:21:23 -0800 (PST)
In-Reply-To: <4F4DC876.3010809@delphij.net>
References: <4F4DC876.3010809@delphij.net>
Date: Wed, 29 Feb 2012 14:21:23 +0300
Message-ID: <CAE-mSOJU=hm8+-AC_oQmx+h2grv7PGaH7kNYKoT3GMePDPXsYg@mail.gmail.com>
From: Sergey Kandaurov <pluknet@gmail.com>
To: d@delphij.net
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: Jilles Tjoelker <jilles@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: RFC: futimens(2) and utimensat(2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Feb 2012 11:51:23 -0000

On 29 February 2012 10:40, Xin Li <delphij@delphij.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> Hi,
>
> These are required by IEEE Std 1003.1-2008. =A0Patchset at:
>
> http://people.freebsd.org/~delphij/for_review/utimens.diff
>

First, thank you very much for doing this.

ERRORS section for utimes(2) is still not updated (not exists).
Funny but that was the most difficult part to implement these
syscalls a year ago with the great help from jilles@.
He could further comment on your patchset.

Otherwise looks good and pretty similar to my work, though
I didn't use a "const" modifier in my version for both functions
and syscall definitions in syscall.master for some reasons.

Further I wrote a test to see how properly implementation detects
EACCES/EPERM with different UTIME_OMIT/UTIME_NOW passed. It shall pass
all tests as shown in the table (stolen somewhere from austingroup):

  [a]    [b]      [c]
 times  file     file
 arg.    UID      is
 NULL   owner   writable        Result
 !NULL  !owner  !writable

 N      o          w            success
 N      o          !w           success
 N      !          w            success
 N      !o         !w           EACCES [1]
 !N     o          w            success
 !N     o          !w           success
 !N     !o         w            EPERM [2]
 !N     !o         !w           EPERM [3]

Here NULL also covers cases when:
- both fields are UTIME_NULL
- both fields are UTIME_OMIT.

Ok, lets see how it does:

1) Given: UTIME_NOW UTIME_NOW o w
gives: success
expected: success

2) Given: UTIME_NOW UTIME_NOW o !w
gives: success
expected: success

3) Given: UTIME_NOW UTIME_NOW !o w
gives: EPERM
expected: success

4) Given: UTIME_NOW UTIME_NOW !o !w
gives: EPERM
expected: EACCES

5) Given: (NULL) (NULL) o w
gives: success
expected: success

6) Given: (NULL) (NULL) o !w
gives: success
expected: success

7) Given: (NULL) (NULL) !o w
gives: success
expected: success

8) Given: (NULL) (NULL) !o !w
gives: EACCES
expected: EACCES

9) Given: (number) (number) o w
gives: success
expected: success

10) Given: (number) (number) o !w
gives: success
expected: success

11) Given: (number) (number) !o w
gives: EPERM
expected: EPERM

12) Gives: (number) (number) !o !w
gives: EPERM
expected: EPERM

So, your version doesn't differentiate the case with
both UTIME_NULL as a special case when you need
to grant caller more privileges as if this was the case
with both NULL pointers. My version handles this.

Your version uses two calls to vfs_timestamp() in different
condition branches. It could be done just once.

My version of getutimens() is more complicated but it handles
the case with both UTIME_NOW.

This is the older version last time discussed with jilles.
It misses man page update and compat32 parts (both were
done since then except missing ERROR section in utimes(2).
e.g. my compat32 version is just as yours :)).
I started to commit my version (you can see r227447) but
failed due to missing ERROR section, my lack of english to
rewrite utimes(2) man page, and too complicated and wrong
ERROR section in the existing utimes(2).

http://plukky.net/~pluknet/patches/utimes.2008.3.diff

It is pretty similar to your except I done getutimens() a bit different.
I had to introduce such complication to pass all tests.
Take note on private flags UTIMENS_NULL and UTIMENS_EXIT.


Index: sys/kern/vfs_syscalls.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/kern/vfs_syscalls.c	(revision 220831)
+++ sys/kern/vfs_syscalls.c	(working copy)
@@ -94,6 +94,8 @@

 static int chroot_refuse_vdir_fds(struct filedesc *fdp);
 static int getutimes(const struct timeval *, enum uio_seg, struct timespec=
 *);
+static int getutimens(const struct timespec *, enum uio_seg,
+    struct timespec *, int *);
 static int setfown(struct thread *td, struct vnode *, uid_t, gid_t);
 static int setfmode(struct thread *td, struct vnode *, int);
 static int setfflags(struct thread *td, struct vnode *, int);
@@ -3162,9 +3164,61 @@
 }

 /*
- * Common implementation code for utimes(), lutimes(), and futimes().
+ * Common implementation code for futimens(), utimensat().
  */
+#define	UTIMENS_NULL	0x1
+#define	UTIMENS_EXIT	0x2
 static int
+getutimens(usrtsp, tspseg, tsp, retflags)
+	const struct timespec *usrtsp;
+	enum uio_seg tspseg;
+	struct timespec *tsp;
+	int *retflags;
+{
+	int error;
+	struct timespec tsnow;
+
+	vfs_timestamp(&tsnow);
+	*retflags =3D 0;
+	if (usrtsp =3D=3D NULL) {
+		tsp[0] =3D tsnow;
+		tsp[1] =3D tsnow;
+		*retflags |=3D UTIMENS_NULL;
+		return (0);
+	}
+	if (tspseg =3D=3D UIO_SYSSPACE) {
+		tsp[0] =3D usrtsp[0];
+		tsp[1] =3D usrtsp[1];
+	} else if ((error =3D copyin(usrtsp, tsp, sizeof(*tsp) * 2)) !=3D 0)
+			return (error);
+
+	if (tsp[0].tv_nsec =3D=3D UTIME_OMIT && tsp[1].tv_nsec =3D=3D UTIME_OMIT)
+		*retflags |=3D UTIMENS_EXIT;
+	if (tsp[0].tv_nsec =3D=3D UTIME_NOW && tsp[1].tv_nsec =3D=3D UTIME_NOW)
+		*retflags |=3D UTIMENS_NULL;
+
+	if (tsp[0].tv_nsec =3D=3D UTIME_OMIT)
+		tsp[0].tv_sec =3D VNOVAL;
+	else if (tsp[0].tv_nsec =3D=3D UTIME_NOW)
+		tsp[0] =3D tsnow;
+	else if (tsp[0].tv_nsec < 0 || tsp[0].tv_nsec >=3D 1000000000L)
+		return (EINVAL);
+
+	if (tsp[1].tv_nsec =3D=3D UTIME_OMIT)
+		tsp[1].tv_sec =3D VNOVAL;
+	else if (tsp[1].tv_nsec =3D=3D UTIME_NOW)
+		tsp[1] =3D tsnow;
+	else if (tsp[1].tv_nsec < 0 || tsp[1].tv_nsec >=3D 1000000000L)
+		return (EINVAL);
+
+	return (0);
+}
+
+/*
+ * Common implementation code for utimes(), lutimes(), futimes(), futimens=
(),
+ * and utimensat().
+ */
+static int
 setutimes(td, vp, ts, numtimes, nullflag)
 	struct thread *td;
 	struct vnode *vp;
@@ -3362,6 +3416,94 @@
 	return (error);
 }

+#ifndef _SYS_SYSPROTO_H_
+struct futimens_args {
+	int fd;
+	struct timespec *times;
+};
+#endif
+int
+futimens(struct thread *td, struct futimens_args *uap)
+{
+
+	return (kern_futimens(td, uap->fd, uap->times, UIO_USERSPACE));
+}
+
+int
+kern_futimens(struct thread *td, int fd, struct timespec *tptr,
+    enum uio_seg tptrseg)
+{
+	struct timespec ts[2];
+	struct file *fp;
+	int error, flags, vfslocked;
+
+	AUDIT_ARG_FD(fd);
+	if ((error =3D getutimens(tptr, tptrseg, ts, &flags)) !=3D 0)
+		return (error);
+	if (flags & UTIMENS_EXIT)
+		return (0);
+	if ((error =3D getvnode(td->td_proc->p_fd, fd, &fp)) !=3D 0)
+		return (error);
+	vfslocked =3D VFS_LOCK_GIANT(fp->f_vnode->v_mount);
+#ifdef AUDIT
+	vn_lock(fp->f_vnode, LK_SHARED | LK_RETRY);
+	AUDIT_ARG_VNODE1(fp->f_vnode);
+	VOP_UNLOCK(fp->f_vnode, 0);
+#endif
+	error =3D setutimes(td, fp->f_vnode, ts, 2, flags & UTIMENS_NULL);
+	VFS_UNLOCK_GIANT(vfslocked);
+	fdrop(fp, td);
+	return (error);
+}
+
+#ifndef _SYS_SYSPROTO_H_
+struct utimensat_args {
+	int fd;
+	const char *path;
+	const struct timespec *times;
+	int flag;
+};
+#endif
+int
+utimensat(struct thread *td, struct utimensat_args *uap)
+{
+
+	return (kern_utimensat(td, uap->fd, uap->path, UIO_USERSPACE,
+	    uap->times, UIO_USERSPACE, uap->flag));
+}
+
+int
+kern_utimensat(struct thread *td, int fd, char *path, enum uio_seg pathseg=
,
+    struct timespec *tptr, enum uio_seg tptrseg, int flag)
+{
+	struct nameidata nd;
+	struct timespec ts[2];
+	int error, flags, vfslocked;
+
+	if (flag & ~AT_SYMLINK_NOFOLLOW)
+		return (EINVAL);
+
+	if ((error =3D getutimens(tptr, tptrseg, ts, &flags)) !=3D 0)
+		return (error);
+	NDINIT_AT(&nd, LOOKUP, ((flag & AT_SYMLINK_NOFOLLOW) ? NOFOLLOW :
+	    FOLLOW) | MPSAFE | AUDITVNODE1, pathseg, path, fd, td);
+	if ((error =3D namei(&nd)) !=3D 0)
+		return (error);
+	/*
+	 * We are allowed to call namei() regardless of 2xUTIME_OMIT.
+	 * POSIX states:
+	 * "If both tv_nsec fields are UTIME_OMIT... EACCESS may be detected."
+	 * "Search permission is denied by a component of the path prefix."
+	 */
+	vfslocked =3D NDHASGIANT(&nd);
+	NDFREE(&nd, NDF_ONLY_PNBUF);
+	if ((flags & UTIMENS_EXIT) =3D=3D 0)
+		error =3D setutimes(td, nd.ni_vp, ts, 2, flags & UTIMENS_NULL);
+	vrele(nd.ni_vp);
+	VFS_UNLOCK_GIANT(vfslocked);
+	return (error);
+}
+
 /*
  * Truncate a file given its path name.
  */


--=20
wbr,
pluknet

From owner-freebsd-arch@FreeBSD.ORG  Wed Feb 29 12:04:47 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 32A4C106566B
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 12:04:47 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 9521A8FC1B
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 12:04:46 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q1TC4VLB084001;
	Wed, 29 Feb 2012 14:04:31 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q1TC4Vrd094904; Wed, 29 Feb 2012 14:04:31 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q1TC4V88094903; 
	Wed, 29 Feb 2012 14:04:31 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Wed, 29 Feb 2012 14:04:31 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: d@delphij.net
Message-ID: <20120229120431.GX55074@deviant.kiev.zoral.com.ua>
References: <4F4DC876.3010809@delphij.net>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="3BXhlsXkTW/kybY4"
Content-Disposition: inline
In-Reply-To: <4F4DC876.3010809@delphij.net>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: freebsd-arch@freebsd.org
Subject: Re: RFC: futimens(2) and utimensat(2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Feb 2012 12:04:47 -0000


--3BXhlsXkTW/kybY4
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Feb 28, 2012 at 10:40:54PM -0800, Xin Li wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>=20
> Hi,
>=20
> These are required by IEEE Std 1003.1-2008.  Patchset at:
>=20
> http://people.freebsd.org/~delphij/for_review/utimens.diff
The patch looks fine, I have only some stylistic comments.

You misordered the functions both in Symbol.map and in the man page.
The kern_utimensat() definition would benefit from making the second
line of the function shorter then 80 columns.
I suggest to use a local struct vnode *vp variable instead of dereferencing
fp->f_vnode on each line.
Put error and vfslocked declarations in kern_futimens on the same line.

I do not see a need in having _SYS_SYSPROTO_H_ for new syscalls.
We always do have sysproto.h.

And, omiting the generated files from the patch would make it easier to rea=
d.

--3BXhlsXkTW/kybY4
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk9OFE8ACgkQC3+MBN1Mb4hykQCg27zu75p6Z/Uj4K7YlW6E6V0C
NNoAn3RtY6vmZj1K61oVfOQuM6c5trM3
=O1m5
-----END PGP SIGNATURE-----

--3BXhlsXkTW/kybY4--

From owner-freebsd-arch@FreeBSD.ORG  Wed Feb 29 13:08:43 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 26661106566C;
	Wed, 29 Feb 2012 13:08:43 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au
	[211.29.132.190])
	by mx1.freebsd.org (Postfix) with ESMTP id AFCF98FC18;
	Wed, 29 Feb 2012 13:08:41 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q1TD8SGV017994
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 1 Mar 2012 00:08:29 +1100
Date: Thu, 1 Mar 2012 00:08:28 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Sergey Kandaurov <pluknet@gmail.com>
In-Reply-To: <CAE-mSOJU=hm8+-AC_oQmx+h2grv7PGaH7kNYKoT3GMePDPXsYg@mail.gmail.com>
Message-ID: <20120229232250.G3812@besplex.bde.org>
References: <4F4DC876.3010809@delphij.net>
	<CAE-mSOJU=hm8+-AC_oQmx+h2grv7PGaH7kNYKoT3GMePDPXsYg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED; BOUNDARY="0-1303333959-1330520908=:3812"
Cc: Jilles Tjoelker <jilles@FreeBSD.org>, d@delphij.net,
	freebsd-arch@FreeBSD.org
Subject: Re: RFC: futimens(2) and utimensat(2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Feb 2012 13:08:43 -0000

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-1303333959-1330520908=:3812
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Wed, 29 Feb 2012, Sergey Kandaurov wrote:

> On 29 February 2012 10:40, Xin Li <delphij@delphij.net> wrote:
>>
>> These are required by IEEE Std 1003.1-2008. =A0Patchset at:
>>
>> http://people.freebsd.org/~delphij/for_review/utimens.diff

I didn't look at this because it wasn't in the mail :-).

> This is the older version last time discussed with jilles.
> It misses man page update and compat32 parts (both were
> done since then except missing ERROR section in utimes(2).
> e.g. my compat32 version is just as yours :)).
> I started to commit my version (you can see r227447) but
> failed due to missing ERROR section, my lack of english to
> rewrite utimes(2) man page, and too complicated and wrong
> ERROR section in the existing utimes(2).
>
> http://plukky.net/~pluknet/patches/utimes.2008.3.diff
>
> It is pretty similar to your except I done getutimens() a bit different.
> I had to introduce such complication to pass all tests.
> Take note on private flags UTIMENS_NULL and UTIMENS_EXIT.
>
> Index: sys/kern/vfs_syscalls.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- sys/kern/vfs_syscalls.c=09(revision 220831)
> +++ sys/kern/vfs_syscalls.c=09(working copy)
> ...
> static int
> +getutimens(usrtsp, tspseg, tsp, retflags)
> +=09const struct timespec *usrtsp;
> +=09enum uio_seg tspseg;
> +=09struct timespec *tsp;
> +=09int *retflags;

Should probably not use K&R function definitions in new code.

> +{
> +=09int error;
> +=09struct timespec tsnow;

Structs should be sorted before scalars (and pointers).

> +
> +=09vfs_timestamp(&tsnow);

Not used in all paths.

> +=09*retflags =3D 0;

Not used in all paths?

> +=09if (usrtsp =3D=3D NULL) {
> +=09=09tsp[0] =3D tsnow;
> +=09=09tsp[1] =3D tsnow;
> +=09=09*retflags |=3D UTIMENS_NULL;
> +=09=09return (0);
> +=09}
> +=09if (tspseg =3D=3D UIO_SYSSPACE) {
> +=09=09tsp[0] =3D usrtsp[0];
> +=09=09tsp[1] =3D usrtsp[1];
> +=09} else if ((error =3D copyin(usrtsp, tsp, sizeof(*tsp) * 2)) !=3D 0)
> +=09=09=09return (error);

Indentation.

> +

Extra blank line.  Many more of these below.

> +=09if (tsp[0].tv_nsec =3D=3D UTIME_OMIT && tsp[1].tv_nsec =3D=3D UTIME_O=
MIT)
> +=09=09*retflags |=3D UTIMENS_EXIT;
> +=09if (tsp[0].tv_nsec =3D=3D UTIME_NOW && tsp[1].tv_nsec =3D=3D UTIME_NO=
W)
> +=09=09*retflags |=3D UTIMENS_NULL;
> +
> +=09if (tsp[0].tv_nsec =3D=3D UTIME_OMIT)
> +=09=09tsp[0].tv_sec =3D VNOVAL;

tsp[0].tv_nsec is not initialized (except it is UTIME_OMIT, which might
be the same as VNOVAL).  The patch seems to be missing the header part
that defines UTIME_OMIT).  Most setattr vnops are sloppy about checking
both tv_sec and tv_nsec, but VATTR_NULL() sets both to VNOVAL for
setattrs that don't request a time change.  More care is actually
required in the opposite direction -- getattr defaults va_birthtime.
tv_sec.tv_nsec to -1.0, so that when a getattr doesn't understand
birthime it comes back back unchanged as -1.0 which gives the error
value (time_t)-1.  All attributes for getattr should be defaulted like
this so that all file systems don't have to know about them, but only
va_birthtime, va_fsid and va_rdev are (all the others default to stack
garbage).

> +=09else if (tsp[0].tv_nsec =3D=3D UTIME_NOW)
> +=09=09tsp[0] =3D tsnow;
> +=09else if (tsp[0].tv_nsec < 0 || tsp[0].tv_nsec >=3D 1000000000L)
> +=09=09return (EINVAL);
> +
> +=09if (tsp[1].tv_nsec =3D=3D UTIME_OMIT)
> +=09=09tsp[1].tv_sec =3D VNOVAL;
> +=09else if (tsp[1].tv_nsec =3D=3D UTIME_NOW)
> +=09=09tsp[1] =3D tsnow;
> +=09else if (tsp[1].tv_nsec < 0 || tsp[1].tv_nsec >=3D 1000000000L)
> +=09=09return (EINVAL);

Is it possible to extend this API to support birthtimes (and with more
security control, ctimes)?  Encoding more in tv_nsec should do it.
Certain magic values in tsp[1].tv_nsec  would indicate that there are
more than 2 entries in tsp[].  An extra copyin is needed to read the
extra entries (after reading tsp[1] to see if there are more).  Better
add this before the ABI solidifies.

This would have worked for utimes() too, with with magic in tsp[1].tv_usec,
but this seems unnecessary now.

Bruce
--0-1303333959-1330520908=:3812--

From owner-freebsd-arch@FreeBSD.ORG  Wed Feb 29 19:41:20 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8B9301065675
	for <arch@freebsd.org>; Wed, 29 Feb 2012 19:41:20 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 323EB8FC08
	for <arch@freebsd.org>; Wed, 29 Feb 2012 19:41:19 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id 6FCC17300B; Wed, 29 Feb 2012 20:40:42 +0100 (CET)
Date: Wed, 29 Feb 2012 20:40:42 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: arch@freebsd.org
Message-ID: <20120229194042.GA10921@onelab2.iet.unipi.it>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="EVF5PPMfhYS0aIcm"
Content-Disposition: inline
User-Agent: Mutt/1.4.2.3i
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: 
Subject: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Feb 2012 19:41:20 -0000


--EVF5PPMfhYS0aIcm
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

I have always been annoyed by the fact that FreeBSD rounds timeouts
in select/usleep/poll in very conservative ways, so i decided to
try how other systems behave in this respect. Attached is a simple
program that you should be able to compile and run on various OS
and see what happens.

Here are the results (HZ=1000 on the system under test, and FreeBSD
has the same behaviour since at least 4.11):

	        |    Actual timeout
                |      select            | poll  | usleep|
	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
	--------+-------+-------+--------+-------+-------+
	    1      2000      99       6     0      2000
	   10      2000     109      15     0      2000
	   50      2000     149      66     0      2000
	  100      2000     196     133     0      2000
	  500      2000     597     617     0      2000
	 1000      2000    1103    1136    2000    2000
	 1001      3000    1103    1136    2000    3000 <---
	 1500      3000    1608    1631    2000    3000 <---
         2000	   3000    2096    2127    3000    3000
	 2001	   4000                    3000    4000 <---
	 3001	   5000                    4000    5000 <---


Note how the rounding (poll has the timeout in milliseconds) affects
the actual timeouts when you are past multiples of 1/HZ.

I know that until we have some hi-res interrupt source there is no
hope to have better than 1/HZ granularity. However we are doing
much worse by adding up to 2 extra ticks. This makes apps less
responsive than they could be, and gives us no way to
"yield until the next tick".

So what I would like to do is add a sysctl (disabled by
default) that enables a better approximation of the desired delay.

I see in the kernel that all three syscalls loop around a blocking
function (tsleep or seltdwait), and do check the "actual" elapsed
time by calling getmicrouptime() or getnanouptime() around the
sleeping function .  So the actual timeout passed to tsleep does
not really matter (as long as it is greater than 0 ).

The only concern is that getmicrouptime()/getnanouptime() are documented
as "less precise, but faster to obtain". The question is how precise is
"less precise": do we have some way to get an upper bound for the
precision of the timers used in get*time(), so we can use that value
in the equation instead of the extra 1/HZ that tvtohz() puts in
after computing floor(timeout*HZ) ?


For reference, below is the core of usleep and select/poll
(from kern_time.c and sys_generic.c)

    usleep:
	getnanouptime(now)
	end = now + timeout;
	for (;;) {
		getnanouptime(now);
		delta = end - now;
		if (delta <= 0)
			break;
		tsleep(..., tvtohz(delta) )
	}

    select/poll:
	itimerfix(timeout) // force at least 1/HZ
	getmicrouptime(now)
	end = now + timeout;
	for (;;) {
		delta = end - now;
		seltdwait(..., tvtohz(delta) )
		getmicrouptime(now);
		if (some_fd_is_ready() || now >= end)
			break;
	}

---

cheers
luigi

--EVF5PPMfhYS0aIcm--

From owner-freebsd-arch@FreeBSD.ORG  Wed Feb 29 20:55:20 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0FBDC106566C
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 20:55:20 +0000 (UTC)
	(envelope-from mavbsd@gmail.com)
Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com
	[209.85.214.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 886938FC08
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 20:55:19 +0000 (UTC)
Received: by bkcjc3 with SMTP id jc3so4715377bkc.13
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 12:55:03 -0800 (PST)
Received-SPF: pass (google.com: domain of mavbsd@gmail.com designates
	10.205.135.132 as permitted sender) client-ip=10.205.135.132; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of mavbsd@gmail.com
	designates 10.205.135.132 as permitted sender)
	smtp.mail=mavbsd@gmail.com; dkim=pass header.i=mavbsd@gmail.com
Received: from mr.google.com ([10.205.135.132])
	by 10.205.135.132 with SMTP id ig4mr1154425bkc.20.1330548903811
	(num_hops = 1); Wed, 29 Feb 2012 12:55:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:content-type:content-transfer-encoding;
	bh=T5Im6v+xo0fI0dI+03awtElalqqQwepSlZObycirdlo=;
	b=fa3o4zqZJzdDRaznhAmZn4eXOICsphMEFcWxZMGqDQiPZjdVsE3JcRtsP/Z1NL8j3q
	0rOMiACY/L+vDOs0zWIqpoFVRaiU9iBhj798yIZtFtdwyRyZ+F908mLA7w+Omwo+ffkp
	qToTeKhUS41YPFFvAyKCMOjKaYGXBCcQ2Ke4g=
Received: by 10.205.135.132 with SMTP id ig4mr901293bkc.20.1330547439306;
	Wed, 29 Feb 2012 12:30:39 -0800 (PST)
Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226])
	by mx.google.com with ESMTPS id x22sm39515997bkw.11.2012.02.29.12.30.37
	(version=SSLv3 cipher=OTHER); Wed, 29 Feb 2012 12:30:38 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <4F4E8AE4.6080705@FreeBSD.org>
Date: Wed, 29 Feb 2012 22:30:28 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
	rv:9.0) Gecko/20120116 Thunderbird/9.0
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>
References: <mailpost.1330544498.4118939.37162.mailing.freebsd.arch@FreeBSD.cs.nctu.edu.tw>
In-Reply-To: <mailpost.1330544498.4118939.37162.mailing.freebsd.arch@FreeBSD.cs.nctu.edu.tw>
Content-Type: text/plain; charset=KOI8-R; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@FreeBSD.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Feb 2012 20:55:20 -0000

On 29.02.2012 21:40, Luigi Rizzo wrote:
> I have always been annoyed by the fact that FreeBSD rounds timeouts
> in select/usleep/poll in very conservative ways, so i decided to
> try how other systems behave in this respect. Attached is a simple
> program that you should be able to compile and run on various OS
> and see what happens.
>
> Here are the results (HZ=1000 on the system under test, and FreeBSD
> has the same behaviour since at least 4.11):
>
> 	        |    Actual timeout
>                  |      select            | poll  | usleep|
> 	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
> 	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
> 	--------+-------+-------+--------+-------+-------+
> 	    1      2000      99       6     0      2000
> 	   10      2000     109      15     0      2000
> 	   50      2000     149      66     0      2000
> 	  100      2000     196     133     0      2000
> 	  500      2000     597     617     0      2000
> 	 1000      2000    1103    1136    2000    2000
> 	 1001      3000    1103    1136    2000    3000<---
> 	 1500      3000    1608    1631    2000    3000<---
>           2000	   3000    2096    2127    3000    3000
> 	 2001	   4000                    3000    4000<---
> 	 3001	   5000                    4000    5000<---
>
>
> Note how the rounding (poll has the timeout in milliseconds) affects
> the actual timeouts when you are past multiples of 1/HZ.
>
> I know that until we have some hi-res interrupt source there is no
> hope to have better than 1/HZ granularity. However we are doing
> much worse by adding up to 2 extra ticks. This makes apps less
> responsive than they could be, and gives us no way to
> "yield until the next tick".
>
> So what I would like to do is add a sysctl (disabled by
> default) that enables a better approximation of the desired delay.
>
> I see in the kernel that all three syscalls loop around a blocking
> function (tsleep or seltdwait), and do check the "actual" elapsed
> time by calling getmicrouptime() or getnanouptime() around the
> sleeping function .  So the actual timeout passed to tsleep does
> not really matter (as long as it is greater than 0 ).
>
> The only concern is that getmicrouptime()/getnanouptime() are documented
> as "less precise, but faster to obtain". The question is how precise is
> "less precise": do we have some way to get an upper bound for the
> precision of the timers used in get*time(), so we can use that value
> in the equation instead of the extra 1/HZ that tvtohz() puts in
> after computing floor(timeout*HZ) ?

"less precise" there means they are updated on hardclock() invocation 
every 1/HZ.

> For reference, below is the core of usleep and select/poll
> (from kern_time.c and sys_generic.c)
>
>      usleep:
> 	getnanouptime(now)
> 	end = now + timeout;
> 	for (;;) {
> 		getnanouptime(now);
> 		delta = end - now;
> 		if (delta<= 0)
> 			break;
> 		tsleep(..., tvtohz(delta) )
> 	}
>
>      select/poll:
> 	itimerfix(timeout) // force at least 1/HZ
> 	getmicrouptime(now)
> 	end = now + timeout;
> 	for (;;) {
> 		delta = end - now;
> 		seltdwait(..., tvtohz(delta) )
> 		getmicrouptime(now);
> 		if (some_fd_is_ready() || now>= end)
> 			break;
> 	}
>


-- 
Alexander Motin

From owner-freebsd-arch@FreeBSD.ORG  Wed Feb 29 23:17:52 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E282310656A9
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 23:17:51 +0000 (UTC)
	(envelope-from info@targipolitalia.com)
Received: from smtplq03.aruba.it (smtplqs-out16.aruba.it [62.149.158.56])
	by mx1.freebsd.org (Postfix) with SMTP id 3BC2D8FC1F
	for <freebsd-arch@freebsd.org>; Wed, 29 Feb 2012 23:17:51 +0000 (UTC)
Received: (qmail 21607 invoked by uid 89); 29 Feb 2012 22:51:09 -0000
Received: from unknown (HELO smtp1.aruba.it) (62.149.158.221)
	by smtplq03.aruba.it with SMTP; 29 Feb 2012 22:51:09 -0000
Received: (qmail 21217 invoked by uid 89); 29 Feb 2012 22:51:09 -0000
Received: from unknown (HELO DARIUSZTRZASKA1)
	(info@targipolitalia.com@151.50.30.42)
	by smtp1.ad.aruba.it with SMTP; 29 Feb 2012 22:51:09 -0000
From: "Dariusz Trzaska" <info@targipolitalia.com>
To: "freebsd-arch" <freebsd-arch@freebsd.org>
MIME-Version: 1.0
Organization: www.targipolitalia.com
Date: Wed, 29 Feb 2012 23:50:57 +0100
X-Antivirus: avast! (VPS 120229-1, 2012-02-29), Outbound message
X-Antivirus-Status: Clean
X-Spam-Rating: smtp1.ad.aruba.it 1.6.2 0/1000/N
X-Spam-Rating: smtplq03.aruba.it 1.6.2 0/1000/N
Message-Id: <20120229231751.E282310656A9@hub.freebsd.org>
Content-Type: text/plain; charset="iso-8859-2"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: =?iso-8859-2?q?Nowa_wiadomo=B6=E6?=
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Feb 2012 23:17:52 -0000

We invite you to visit our newly formed and remodeled website:
=20
http://www.targipolitalia.com
=20
The world's first international exhibition on-line now active. Free re=
gistration gives the possibility of advertising, which can lead to the=
 development and success of the company. We designed a "VIRTUAL FAIR B=
oxes" run functions that operate after registering and logging in, for=
 all firms and individuals registered on the portal. You can place ads=
 in all languages, such as: I am looking for customers, suppliers, con=
tractors as well as an investor, partner, etc. Similarly, you can brow=
se proposals from other companies. We are open for cooperation, as wel=
l as suggestions on how to further improve the functioning of site.
Greetings and welcome to register a company as well as private individ=
uals.
=20
Dariusz Trzaska
Electronic signature no. 287732/CCK/2011
Mob. +39 3806460196
E-mail: info@targipolitalia.com
http://www.targipolitalia.com
=20


From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 00:33:52 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 06A10106564A
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 00:33:52 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au
	[211.29.132.187])
	by mx1.freebsd.org (Postfix) with ESMTP id 36A208FC17
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 00:33:50 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q210Xkbe009834
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 1 Mar 2012 11:33:48 +1100
Date: Thu, 1 Mar 2012 11:33:46 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Luigi Rizzo <rizzo@iet.unipi.it>
In-Reply-To: <20120229194042.GA10921@onelab2.iet.unipi.it>
Message-ID: <20120301071145.O879@besplex.bde.org>
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 00:33:52 -0000

On Wed, 29 Feb 2012, Luigi Rizzo wrote:

> I have always been annoyed by the fact that FreeBSD rounds timeouts
> in select/usleep/poll in very conservative ways, so i decided to
> try how other systems behave in this respect. Attached is a simple
> program that you should be able to compile and run on various OS
> and see what happens.

Many are broken, indeed.

The simple program isn't attached.

> Here are the results (HZ=1000 on the system under test, and FreeBSD
> has the same behaviour since at least 4.11):
>
> 	        |    Actual timeout
>                |      select            | poll  | usleep|
> 	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
> 	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
> 	--------+-------+-------+--------+-------+-------+
> 	    1      2000      99       6     0      2000

Try HZ = 20 (possible, at the user's option, even with an i8254 timer)
or lower (possible, at the user's option, with better timers).  FreeBSD
should then get timeouts of up to 2/HZ = 100000 us.  Applications must
deal with this range somehow (maybe by telling the user to configure
HZ better).  In ttcp, I found the timeouts unusable and resorted to
an option to busy-wait.  The rate-limiting timeouts in tools/netrate 
don't work at all for small HZ, and barely work for large HZ, since
timeouts are similarly unusable.

It is possible to easily improve this to a maximum of only 1/HZ = 50000
us, at some cost to efficiency, by waking up 1 tick early, checking
if the timeout has expired, and sleeping for another tick if it hasn't.

Waking up early is needed anyway for long timeouts, in case the timeouts
interrupts are running a little slower than their estimated frequency
-- being off by just 1 part per million will accumulate to an error
of 86400 us after 1 day, and FreeBSD cluser machines used to be off
by about 10%, giving an error of 2.4 hours for your appointment next
day.  The error will often be 100's of parts per million, giving an
error of 10's of seconds per day.  To handle the 10% error, timeouts
must wake up 10% early.

select(), poll(), and nanosleep() all check that the timeout expires
when they wake up, but they don't set it to be wake up early, and their
check for whether it has expired is even fuzzier than the timeout
granularity (it uses the broken-as-designed getnanouptime() API to get
the current time; timecounters are only updated every
$(sysctl kern.timecounter.tick) ticks, by default to limit their
update frequency to about 1000 Hz when HZ is configured to be much
larger than 1000.  This ensures an extra, unnecessary innaccuracy of
up to 1000 us whenever the timeout wakes up a little early according
the the retarded clock used to measure the current time.  I think
the error is fail-safe -- it may extend the timeout by as much as
1000 usec.

> 	   10      2000     109      15     0      2000
> 	   50      2000     149      66     0      2000
> 	  100      2000     196     133     0      2000
> 	  500      2000     597     617     0      2000

You must have synced with timer interrupts to get the above.  Timeouts
in the current FreeBSD implementation should average the actual timeout
rounded up to a multiple of 1/HZ seconds, plus 0.5/HZ seconds, and thus
average 1.5/HZ = 1500 us for short timeouts.

Someone apparently broke poll() on FreeBSD :-(.

Linux and OSX must be using busy-waiting or expensive timer
reprogramming for short timeouts to work.  Linux-2.6.10 has 3491
references to udelay().  This seems to correspond to FreeBSD' DELAY().
The linux nanosleep() code is too complicated for me to easily see
what it is doing for short timeouts, but I noticed that it isn't
missing clock_nanosleep() like FreeBSD does, and presumably has
a collaterally non-broken nanosleep().  (POSIX requires nanosleep()
to sleep in real time, to be bug for bug compatible with old sleep(),
but this is often not what is wanted.  So POSIX invented
clock_nanosleep() so as to be able to sleep on the monotonic clock
and on any other clock of interest.  FreeBSD doesn't know anything
about this, and only has nanosleep(), which sleeps on a wrong clock
(the monotonic one).

> 	 1000      2000    1103    1136    2000    2000
> 	 1001      3000    1103    1136    2000    3000 <---
> 	 1500      3000    1608    1631    2000    3000 <---
>         2000	   3000    2096    2127    3000    3000
> 	 2001	   4000                    3000    4000 <---
> 	 3001	   5000                    4000    5000 <---
>
> Note how the rounding (poll has the timeout in milliseconds) affects
> the actual timeouts when you are past multiples of 1/HZ.

Also, timeouts that are just before a multiple of 1/HZ may be turned
into 1/HZ + 1000 usec by the inacurracy of getnanouptime().  E.g., if
the requested timeout is 999, HZ is 1000, and tc_tick is 1, 999 should
be turned into 2 ticks (average 1500 usec).  Then, if the timeout for
the second tick is a little early according to the retarded clock,
the total timeout will be extended by another tick, to 3 ticks.  So
all timeouts may be extended by up to 2 ticks, instead of only ones
just larger than a multiple of 1/HZ being extended by the full 2 ticks.
This can be made much worse by setting tc_tick to a large value.

> I know that until we have some hi-res interrupt source there is no
> hope to have better than 1/HZ granularity. However we are doing
> much worse by adding up to 2 extra ticks. This makes apps less
> responsive than they could be, and gives us no way to
> "yield until the next tick".
>
> So what I would like to do is add a sysctl (disabled by
> default) that enables a better approximation of the desired delay.

It is possible to get a timeout every tick in userland using periodic
itimer's.  Maybe not for the initial timeout, but after that the timeouts
repeat with the specified period (rounded up to the next tick boundary,
but not up to the next + 1).

> I see in the kernel that all three syscalls loop around a blocking
> function (tsleep or seltdwait), and do check the "actual" elapsed
> time by calling getmicrouptime() or getnanouptime() around the
> sleeping function .  So the actual timeout passed to tsleep does
> not really matter (as long as it is greater than 0 ).
>
> The only concern is that getmicrouptime()/getnanouptime() are documented
> as "less precise, but faster to obtain". The question is how precise is
> "less precise": do we have some way to get an upper bound for the

It is tc_tick/HZ seconds (usually 1/HZ, but if you set HZ to be > 1000
to reduce this problem, then tc_tick will bite you unless you change it
down.  Both may be acceptable on a real-timeish system that wants short
timeouts at the cost of efficiency.  But I don't like timeouts.  When
they are used a lot, they are a form of busy waiting.  This is only
acceptable if you have CPU to burn).

> precision of the timers used in get*time(), so we can use that value
> in the equation instead of the extra 1/HZ that tvtohz() puts in
> after computing floor(timeout*HZ) ?

It's always worse than 1/HZ if you use these broken-as-designed APIs.

> For reference, below is the core of usleep and select/poll
> (from kern_time.c and sys_generic.c)
>
>    usleep:
> 	getnanouptime(now)

Should use nanotime().  This also fixes the clock id.  But check whether
this is really required by POSIX.  It means that if someone steps the
clock, then the sleep may be extended or truncated significantly.  You
also need to fix the timeout used here to sleep in real time, so that
it doesn't wake up late by an amount of (negative of the step).  Timeouts
also sleep in monotonic time.

Sleeping in monotonic time seems to be wrong in more cases than it is
correct.  Another broken area is suspend/resume.  Suspending for an
hour stops all timeouts by an hour.  Then on resume, they aren't
adusted, so they occur an hour later.  There is still code in
kern_timeout.c under APM_FIXUP_CALLTODO which was supposed to fix this
problem for apm, but it is so poorly maintained that it never even
compiled in any committed version, and there is no option for it.
Problems in this area were supposed to have been fixed by using
monotonic time more, but they seem to have actually been increased.

> 	end = now + timeout;

This is quite broken too:
- it doesn't arrange to wake up early.  Sutract 1 tick here for a quick
   fix for short timeouts.  Maybe 10% for timeouts longer than 10 ticks.
- it can overflow, giving undefined behaviour (in practice, just bad
   results later)
- itimerfix() is supposed to be used to prevent such overflows, but this
   is quite broken:
   - note that itimerfix() is quite different from timevalfix().  Although
     its name is spelled with an 'i' and a 'fix', itimerfix() is useful
     generally and doesn't fix anything except for bogusly adjusting
     fractional ticks.  It is also confusing because its name is spelled
     without a `val', since it was intended to be used only for itimers
     which used to use generic times (timevals) back before better
     representations of times existed.  What it mosly does is validity
     checking for timevals.  OTOH, timevalfix() does pure fixing (to
     handle carry after (possibly multiple) additions and subtractions).
   - timevalfix() used to limit tv_sec to 100 million seconds (else
     EINVAL), but someone broken it by removing this check
   - someone didn't update the man pages which document this limit.  It
     is still documented in at least:
     - setitimer(2).  This is its primary API
     - alarm(3)
     - ualarm(3).  This also bogusly documents what a microsecond is

     Grepping for 100000000 and 100.*million also shows many bad
     descriptions of the limit on tv_nsec for APIs that take a
     timespec arg.  This limit is described verbosely as "1000 million".
     Of course, "1 billion" cannot be used since it is ambiguous, and
     1000000000 should not be used because it is hard to see the number
     of zeros in it, but millions aren't naturally associated with the
     nano prefix.  I like to write such numbers in minimal floating
     point scientific notation (e.g., 1e9) or as powers of 10 (e.g.,
     10**9).  1e9 is better because it is shorter and doesn't need
     an ambiguous '^' operator or a less common Fortran '**' operator
     for exponentiation.  Not sure if this is best for man pages.

     select() and kqueue() used to have the same limit, but it was
     never documented for them.  Not sure about kqueue.

   - this limit isn't permitted by POSIX.
   - here we have timespecs.  itimespecfix() exists too (although
     timespecfix() doesn't).  itimespecfix() has the same semantics
     as itimerfix() (except of course it obviously acts on timespecs
     while itimerfix() unobviously acts on timevals).  itimespecfix()
     never had the limit on tv_sec.

     But timespecfix() has the same bogus rounding up of fractional
     ticks as timerfix().  This is only done for fractional ticks below 1.
     This should be unnecessary provided everywhere else is careful to
     round up and usually to add 1.  timespec*fix() never did the adding
     1 part.  I think they are just defending against sloppy conversions
     that produce 0 ticks from small but nonzero timeouts.  A timeout of
     0 ticks means to sleep forever.  But relevant higher levels also
     defend against this, by silently changing 0 ticks to 1 tick.
- back to bugs at the level of nanosleep().  It can't use itimerfix()
   since it deals with timespecs.  It should call something like
   itimespecfix() (except that should be named timespeccheck()...).
   But IIRC, itimespecfix() didn't exist when nanosleep() was implemented.
   Also, itimespecfix() has wrong semantics for use here..  Its bogus
   rounding up is exactly what you don't want.  I think it has other
   slightly mismatched semantics (perhaps a difference in error numbers),
   but the others are easy to fix up.  So nanosleep() rolled its own
   checking, and got it wrong.

The overflow seems easy to fix as a side affect of waking up early:
- for preposterosterously long timeouts, wake up after 100 million
   seconds or similar, instead of 10% early.  This delays the problem.
   100 million seconds is a little over 3 years, so it won't expire
   in practice, and no one would care if it did.  But this method
   (and the old limit) breaks down about 3 years before the time_t's
   roll over.  So in 2036, you need to limit the timeout to only 2
   years instead of 3, if you are still using 31-bit (sic) time_t's
   with a useless sign bit then.  And in 2105, you need to limit the
   timeout to only 2 years instead of 3, if you are still using 32-bit
   unsigned time_t's then.

   Be careful with overflow even with this fix.  Applications probing
   for kernel bugs will try using maximal tv_sec.  Since POSIX doesn't
   allow rejecting these like the old 100 million second limit did,
   we must start with a long sleep and retain the original preposterous
   timeout so that we can return it as the unslept time.  We have to
   be careful about overflow when adding the preposterous time to the
   current time.  Large time_t's don't do anything to limit this overflow,
   since the appllication can ask for the maximum (2**31-1 or 2**63-1
   in practice), and since the current time is surely >= 1, adding the
   current time to the preposterous time surely gives overflow.

   Example of a not-unreasonable POSIX application to probe for bugs in
   this area:

       set tv.tv_sec and tv.tv_nsec to the maximum possible ("infinity")
       arrange for a signal after 1 second
       nanosleep(&tv, &tv2);
       check that tv2 is about 1 second below the maximum possible

> 	for (;;) {
> 		getnanouptime(now);

Our original `now' and thus `end' are retarded by up to tc_tick/HZ.
This `now' is retarded too.  This complicates the analysis and
changes its results.

> 		delta = end - now;

So `delta' might not even be retarded.  If `end' is normal but `now'
is retarded, then `delta' is advanced.  This case is fail-safe but
not what you want (sleep again).  If `end' is retarded but `now' is
normal, then the retardation in `delta' is maximal (still less than
tc_tick/HZ).  This case is fail-unsafe (return up to tc_tick/HZ
early).  Other cases are in between these, with the retardations
partially or completely cancelling.  By making tc_tick large, the
fail-unsafe case can be made to more than overcome the safety margin
of about 1 1/HZ given by always adding 1.  I only just noticed this
detail.

> 		if (delta <= 0)
> 			break;
> 		tsleep(..., tvtohz(delta) )
> 	}
>
>    select/poll:
> 	itimerfix(timeout) // force at least 1/HZ

That's "bogusly force".  It doesn't add 1 or do anything if the timeout
is above 1/HZ, but both are done later.

> 	getmicrouptime(now)
> 	end = now + timeout;

Same retardation and overflow bugs.

> 	for (;;) {

Missing showing getmicrouptime(now) here?

> 		delta = end - now;
> 		seltdwait(..., tvtohz(delta) )

tvtohz() does round up and add 1.  Its interaction with the above is
unclear.  I think there is double rounding up in some cases.  For
example, even without the retardation, a delta that wants to be
precisely 1 tick (much more than it should be due to the first rounding
up) may be 1 us over 1 tick due to minor inaccuracies.  Then tvtohz()
will round up again.  nanosleep() has the same problem.  To handle this,
we may need to subtract 1 more from the result of tvtohz():
- always subtract 10%
- subtract 1 to compensate for tvtohz() always adding 1.  See the periodic
   itimer code for this fixup.  Periodic itimers need to be more careful
   for long timeouts too.
- subtract 1 more in case delta is a little too large.  Better look at
   delta and not always do this.
- if the resulting timeout is <= 0, change it to 1.

> 		getmicrouptime(now);
> 		if (some_fd_is_ready() || now >= end)
> 			break;
> 	}

Activity on the fd's is likely to give more wakeups than nanosleep()
gets, since the latter only gets woken up for its own timeout and signals.
For this and other reasons, large timeouts are probably smaller for
select() than for nanosleep().  And for poll(), you just can't ask for
a large timeout (the limit is (2**31-1) milliseconds ~= 24.8 days with
32-bit ints, as is the case on all supported arches).

It remains to explain why the above results show that poll() but not
select() is broken for small timeouts (they are turned into 0 us for
poll() and 2000 us for select()).  Well, the granularity for poll is 1
ms, so this looks like just an application bug, with timeouts of < 1 ms
being rounded down to 0 before the kernel sees them.

But how do the other OS's see it?  This might be due to them taking a
long time to handle null timeouts, and their times actually being
reported correctly.
I don't believe the times of 0 and 2000 us reported for FreeBSD.  You
can't do anything in 0 us, and 2000 us is too round a number.  These
round numbers might be due to using the broken as designed
CLOCK_MONOTONIC_FAST_N_BROKEN clock ids.  These are collateral with
getnanouptime() etc.  They are even more broken as designed, since
provided you have non-slow timecounter hardware, the time for
clock_gettime() is dominated by syscall overhead, so
CLOCK_MONOTONIC_FAST_N_BROKEN is only a few percent faster than
CLOCK_MONOTIC.  Typical numbers are:
- 12 (9?) cycles for the hardware part of a TSC timecounter on an old
   Athlon64.  ~250 cycles for the total syscall overhead for an old version
   of FreeBSD UP on almost any x86.  Possible savings from the "fast"
   method: about 5%.  More like 10% due to extra software overhead for the
   timecounter.
- 42 (?) cycles for the hardware part of a TSC timecounter on Phenom+
   (synchronization across CPUs makes it much slower).  Similarly for
   most modern multi-core CPUs.  Intel CPUs were much slower than 12
   cycles even for old single-core ones. ~350 cycles for the total
   syscall overhead for a current version of FreeBSD SMP on almost any
   x86.  Possible savings from the "fast" method: about 15%.  IIRC,
   SMP only costs 20-30 cycles, with the extra 100 being mainly from
   extra layers.
- 1000-2000 nsec (up to ~8000 cycles) for an ACPI-FAST timecounter.
   These are actually ACPI-SLOW.  Now the hardware overhead dominates.
   HPET is better, but has become common at the same time as the TSC
   became usable for SMP, so it is rarely useful as a timecounter.
- up to 5000 nsec for an i8254 timecounter on a modern CPU.  Getting
   slow with more briges between the CPU and the ISA bus.
- up to 30000 nsec for an i8254 timecounter on a 486.
OTOH, getnanouptime() takes about 12 cycles (very fuzzy estimate), so
it is 2-3 times faster than nanouptime() using the fastest TSC hardware,
and about 5-6 times faster than nanouptime() using slower TSC hardware.

IIRC, itimer code doesn't do these checks of the time after wakeups at
all.  Not sure what kqueue does.

I haven't really touched nanosleep(), but have some small fixes near
the tvtohz() call for select() and poll().

% Index: sys_generic.c
% ===================================================================
% RCS file: /home/ncvs/src/sys/kern/sys_generic.c,v
% retrieving revision 1.131
% diff -u -2 -r1.131 sys_generic.c
% --- sys_generic.c	5 Apr 2004 21:03:35 -0000	1.131
% +++ sys_generic.c	13 Aug 2009 11:21:29 -0000
% @@ -806,9 +797,5 @@
%  		getmicrouptime(&rtv);
%  		timevaladd(&atv, &rtv);
% -	} else {
% -		atv.tv_sec = 0;
% -		atv.tv_usec = 0;
%  	}
% -	timo = 0;
%  	TAILQ_INIT(&td->td_selq);
%  	mtx_lock(&sellock);
% @@ -824,5 +811,7 @@
%  	if (error || td->td_retval[0])
%  		goto done;
% -	if (atv.tv_sec || atv.tv_usec) {
% +	if (tvp == NULL)
% +		timo = 0;
% +	else {
%  		getmicrouptime(&rtv);
%  		if (timevalcmp(&rtv, &atv, >=))

Unrelated cleanups of initialization.

% @@ -830,13 +819,10 @@
%  		ttv = atv;
%  		timevalsub(&ttv, &rtv);
% -		timo = ttv.tv_sec > 24 * 60 * 60 ?
% -		    24 * 60 * 60 * hz : tvtohz(&ttv);
% +		timo = tvtohz(&ttv);

The special case for timeouts of > 1 day defeats the careful overflow
handling in tvtohz().  It is supposed to be for avoiding overflow, but
tvtohz() avoids it already, while the above causes it whenever hz
is large but not preposterously so, so that 24 * 60 * 60 * hz overflows.
hz only needs to be 24586 for overflow.  25000 is almost reasonable
for excessive polling, and I have tested lapic timer interrupts though
not hz at 1MHz.

However, reduction of the timeout to a value that will wake up 10% early 
is the first step of fixing the bugs discussed above.  Reduction to
1 day accomplishes this for timeouts of >= 1.1 days provided it doesn't
overflow.

%  	}
% 
%  	/*
% -	 * An event of interest may occur while we do not hold
% -	 * sellock, so check TDF_SELECT and the number of
% -	 * collisions and rescan the file descriptors if
% -	 * necessary.
% +	 * An event of interest may have occurred while we did not hold
% +	 * sellock.  Check for this and rescan if necessary.
%  	 */
%  	mtx_lock_spin(&sched_lock);

Unrelated.

% @@ -985,4 +978,5 @@
%  	if (uap->timeout != INFTIM) {
%  		atv.tv_sec = uap->timeout / 1000;
% +		/* XXX wrong if timeout < 0. */
%  		atv.tv_usec = (uap->timeout % 1000) * 1000;

Since the '%' operator is broken for negative values in C, this gives
a negative tv_usec when the timeout is negative.

%  		if (itimerfix(&atv)) {

itimerfix() then returns EINVAL, and the syscall fails.  But a timeout
of < 0 should be equivalent to a timeout of 0, as it is for select()
and nanosleep().  This can be implemented either by fixing C, or
by fixing the espression, or by just changing negative timeouts to 0.

% @@ -992,9 +986,5 @@
%  		getmicrouptime(&rtv);
%  		timevaladd(&atv, &rtv);
% -	} else {
% -		atv.tv_sec = 0;
% -		atv.tv_usec = 0;
%  	}
% -	timo = 0;
%  	TAILQ_INIT(&td->td_selq);
%  	mtx_lock(&sellock);
% @@ -1006,9 +996,11 @@
%  	mtx_unlock(&sellock);
% 
% -	error = pollscan(td, (struct pollfd *)bits, nfds);
% +	error = pollscan(td, bits, nfds);
%  	mtx_lock(&sellock);
%  	if (error || td->td_retval[0])
%  		goto done;
% -	if (atv.tv_sec || atv.tv_usec) {
% +	if (uap->timeout == INFTIM)
% +		timo = 0;
% +	else {
%  		getmicrouptime(&rtv);
%  		if (timevalcmp(&rtv, &atv, >=))

Unrelated cleanups.

% @@ -1016,12 +1008,8 @@
%  		ttv = atv;
%  		timevalsub(&ttv, &rtv);
% -		timo = ttv.tv_sec > 24 * 60 * 60 ?
% -		    24 * 60 * 60 * hz : tvtohz(&ttv);
% +		timo = tvtohz(&ttv);

As for select().

%  	}
% -	/*
% -	 * An event of interest may occur while we do not hold
% -	 * sellock, so check TDF_SELECT and the number of collisions
% -	 * and rescan the file descriptors if necessary.
% -	 */
% +
% +	/* Rescan if necessary, as above. */

Don't repeat comments ad nauseum.  There used to be large grammar errors
in these comments.  -current may have cleaned them up differently.

%  	mtx_lock_spin(&sched_lock);
%  	if ((td->td_flags & TDF_SELECT) == 0 || nselcoll != ncoll) {

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 00:44:04 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 61B8F106564A
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 00:44:04 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 228918FC0A
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 00:44:03 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id 0B3937300A; Thu,  1 Mar 2012 02:02:19 +0100 (CET)
Date: Thu, 1 Mar 2012 02:02:19 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Bruce Evans <brde@optusnet.com.au>
Message-ID: <20120301010219.GA14508@onelab2.iet.unipi.it>
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
	<20120301071145.O879@besplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20120301071145.O879@besplex.bde.org>
User-Agent: Mutt/1.4.2.3i
Cc: arch@FreeBSD.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 00:44:04 -0000

On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote:
> On Wed, 29 Feb 2012, Luigi Rizzo wrote:
> 
> >I have always been annoyed by the fact that FreeBSD rounds timeouts
> >in select/usleep/poll in very conservative ways, so i decided to
> >try how other systems behave in this respect. Attached is a simple
> >program that you should be able to compile and run on various OS
> >and see what happens.
> 
> Many are broken, indeed.
> 
> The simple program isn't attached.

attachment stripped by the mailing list, retrying to put it inline
(and comments on a followup email)


----



/*
 * test minimum select time
 *
 *	./prog usec [method [duration]]
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/time.h>
#include <poll.h>

enum { M_SELECT =0 , M_POLL, M_USLEEP };
static const char *names[] = { "select", "poll", "usleep" };
int
main(int argc, char *argv[])
{
	struct timeval ta, tb;
	int usec = 1, total = 0, method = M_SELECT, count = 0;

	if (argc > 1)
		usec = atoi(argv[1]);
	if (usec <= 0)
		usec = 1;
	else if (usec > 500000)
		usec = 500000;
	if (argc > 2) {
		if (!strcmp(argv[2], "poll"))
			method = M_POLL;
		else if (!strcmp(argv[2], "usleep"))
			method = M_USLEEP;
	}
	if (argc > 3)
		total = atoi(argv[3]);
	if (total < 1)
		total = 1;
	else if (total > 10)
		total = 10;
	fprintf(stderr, "testing %s for %dus over %ds\n",
		names[method], usec, total);

	gettimeofday(&ta, NULL);
	for (;;) {
		if (method == M_SELECT) {
			struct timeval to = { 0, usec };
			select(0, NULL, NULL, NULL, &to);
		} else if (method == M_POLL) {
			poll(NULL, 0, usec/1000);
		} else {
			usleep(usec);
		}
		count++;
		gettimeofday(&tb, NULL);
		timersub(&tb, &ta, &tb);
		if (tb.tv_sec > total)
			break;
	}
	fprintf(stderr, "%dus actually took %dus\n",
		usec, (int)(tb.tv_sec * 1000000 + tb.tv_usec) / count );
	return 0;
}
-----

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 01:05:01 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4ACD91065673
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 01:05:01 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 066B38FC19
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 01:05:00 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id CC6957300A; Thu,  1 Mar 2012 02:23:15 +0100 (CET)
Date: Thu, 1 Mar 2012 02:23:15 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Bruce Evans <brde@optusnet.com.au>
Message-ID: <20120301012315.GB14508@onelab2.iet.unipi.it>
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
	<20120301071145.O879@besplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20120301071145.O879@besplex.bde.org>
User-Agent: Mutt/1.4.2.3i
Cc: arch@FreeBSD.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 01:05:01 -0000

On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote:
> On Wed, 29 Feb 2012, Luigi Rizzo wrote:
> 
> >I have always been annoyed by the fact that FreeBSD rounds timeouts
> >in select/usleep/poll in very conservative ways, so i decided to
> >try how other systems behave in this respect. Attached is a simple
> >program that you should be able to compile and run on various OS
> >and see what happens.
> 
> Many are broken, indeed.
> 
> The simple program isn't attached.
...

> >
> >	        |    Actual timeout
> >               |      select            | poll  | usleep|
> >	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
> >	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
> >	--------+-------+-------+--------+-------+-------+
> >	    1      2000      99       6     0      2000
> >	   10      2000     109      15     0      2000
> >	   50      2000     149      66     0      2000
> >	  100      2000     196     133     0      2000
> >	  500      2000     597     617     0      2000
> >	 1000      2000    1103    1136    2000    2000
> >	 1001      3000    1103    1136    2000    3000 <---
> >	 1500      3000    1608    1631    2000    3000 <---
> >      2000	   3000    2096    2127    3000    3000
> >	 2001	   4000                    3000    4000 <---
> >	 3001	   5000                    4000    5000 <---
> >
> >Note how the rounding (poll has the timeout in milliseconds) affects
> 
> You must have synced with timer interrupts to get the above.  Timeouts

yes i have -- the test code does almost nothing after returning from
a select, on a system that does some amount of work times could be
up to 1000us shorter. Still a huge error on short timeouts.

I should also comment that these are average values on an otherwise
idle system -- i will try to post a histogram of the actual values,
it might well be that osx and linux have quantized values very
different from the average (though this would violate the specs,
so i suspect instead that they have some cheap one-shot timers).

For FreeBSD I have also rounded the bsd values (actual averages are -1/+3us
over 1sec experiments).

> timeouts at the cost of efficiency.  But I don't like timeouts.  When
> they are used a lot, they are a form of busy waiting.  This is only
> acceptable if you have CPU to burn).

sometimes you have no other way to get a notification.

> It remains to explain why the above results show that poll() but not
> select() is broken for small timeouts (they are turned into 0 us for

no it is just that my application that does the rounding down as the
API only accepts milliseconds.

Thanks for the extensive comments.

cheers
luigi

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 01:16:53 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 576021065670
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 01:16:53 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (ns1.bitblocks.com [173.228.5.8])
	by mx1.freebsd.org (Postfix) with ESMTP id 3EC078FC16
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 01:16:53 +0000 (UTC)
Received: from bitblocks.com (localhost [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id CB4AC1CC32;
	Wed, 29 Feb 2012 16:58:54 -0800 (PST)
To: Bruce Evans <brde@optusnet.com.au>
In-reply-to: Your message of "Thu, 01 Mar 2012 11:33:46 +1100."
	<20120301071145.O879@besplex.bde.org> 
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
	<20120301071145.O879@besplex.bde.org>
Comments: In-reply-to Bruce Evans <brde@optusnet.com.au>
	message dated "Thu, 01 Mar 2012 11:33:46 +1100."
Date: Wed, 29 Feb 2012 16:58:54 -0800
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20120301005854.CB4AC1CC32@mail.bitblocks.com>
Cc: arch@FreeBSD.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 01:16:53 -0000

On Thu, 01 Mar 2012 11:33:46 +1100 Bruce Evans <brde@optusnet.com.au>  wrote:
> Linux and OSX must be using busy-waiting or expensive timer
> reprogramming for short timeouts to work.

Linux-2.6.17 or later have two options: CONFIG_NO_HZ for on
demand timer interrupts (to reduce power use on idle systems)
and CONFIG_HIGH_RES_TIMERS for as accurate timers as h/w would
allow. And yes, timers are reprogrammed (as per a June 23,
2006  kerneltrap.org article).

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 01:47:02 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 41DE3106564A
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 01:47:02 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 068428FC08
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 01:47:00 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id E4EF37300A; Thu,  1 Mar 2012 03:05:15 +0100 (CET)
Date: Thu, 1 Mar 2012 03:05:15 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Bruce Evans <brde@optusnet.com.au>
Message-ID: <20120301020515.GA14996@onelab2.iet.unipi.it>
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
	<20120301071145.O879@besplex.bde.org>
	<20120301012315.GB14508@onelab2.iet.unipi.it>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20120301012315.GB14508@onelab2.iet.unipi.it>
User-Agent: Mutt/1.4.2.3i
Cc: arch@FreeBSD.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 01:47:02 -0000

On Thu, Mar 01, 2012 at 02:23:15AM +0100, Luigi Rizzo wrote:
> On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote:
> > On Wed, 29 Feb 2012, Luigi Rizzo wrote:
> > 
> > >I have always been annoyed by the fact that FreeBSD rounds timeouts
> > >in select/usleep/poll in very conservative ways, so i decided to
> > >try how other systems behave in this respect. Attached is a simple
> > >program that you should be able to compile and run on various OS
> > >and see what happens.
> > 
> > Many are broken, indeed.
> > 
> > The simple program isn't attached.
> ...
> 
> > >
> > >	        |    Actual timeout
> > >               |      select            | poll  | usleep|
> > >	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
> > >	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
> > >	--------+-------+-------+--------+-------+-------+
> > >	    1      2000      99       6     0      2000
> > >	   10      2000     109      15     0      2000
> > >	   50      2000     149      66     0      2000
> > >	  100      2000     196     133     0      2000
> > >	  500      2000     597     617     0      2000
> > >	 1000      2000    1103    1136    2000    2000
> > >	 1001      3000    1103    1136    2000    3000 <---
> > >	 1500      3000    1608    1631    2000    3000 <---
> > >      2000	   3000    2096    2127    3000    3000
> > >	 2001	   4000                    3000    4000 <---
> > >	 3001	   5000                    4000    5000 <---
> > >
> > >Note how the rounding (poll has the timeout in milliseconds) affects
> > 
> > You must have synced with timer interrupts to get the above.  Timeouts
> 
> yes i have -- the test code does almost nothing after returning from
> a select, on a system that does some amount of work times could be
> up to 1000us shorter. Still a huge error on short timeouts.
> 
> I should also comment that these are average values on an otherwise
> idle system -- i will try to post a histogram of the actual values,

Below are the statistics of select() delays on my MacBook for
timeouts of 1-10-50-100-500-1000-1001 us

Interesting that some of the delays are actually up to 25us shorter
than they should, and the average is higher than the requested
value (tends to settle to 100-150us for large delays).

    > ministat -n ~/d1 ~/d10 ~/d50 ~/d100 ~/d500 ~/d1000 ~/d1001
    x /home/luigi/d1
    + /home/luigi/d10
    * /home/luigi/d50
    % /home/luigi/d100
    # /home/luigi/d500
    @ /home/luigi/d1000
    O /home/luigi/d1001
	N           Min           Max        Median           Avg        Stddev
    x 305202             0           943             7      6.553037         2.134
    + 130798             0           862            15     15.290815     2.6807354
    * 30265            18          1002            66     66.083562     10.170399
    % 14480            75          1072           137     138.12894     29.507796
    # 3146           474          1098           656     635.87603     48.670018
    @ 1750           987          1924          1158     1143.2394     48.220706
    O 1748           986          2337          1159     1144.4102     53.547987

cheers
luigi

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 03:14:18 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 98566106564A
	for <arch@freebsd.org>; Thu,  1 Mar 2012 03:14:18 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail28.syd.optusnet.com.au (mail28.syd.optusnet.com.au
	[211.29.133.169])
	by mx1.freebsd.org (Postfix) with ESMTP id 37B998FC0A
	for <arch@freebsd.org>; Thu,  1 Mar 2012 03:14:17 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail28.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q213EEKr031705
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 1 Mar 2012 14:14:15 +1100
Date: Thu, 1 Mar 2012 14:14:14 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Luigi Rizzo <rizzo@iet.unipi.it>
In-Reply-To: <20120301012315.GB14508@onelab2.iet.unipi.it>
Message-ID: <20120301132806.O2255@besplex.bde.org>
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
	<20120301071145.O879@besplex.bde.org>
	<20120301012315.GB14508@onelab2.iet.unipi.it>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 03:14:18 -0000

On Thu, 1 Mar 2012, Luigi Rizzo wrote:

> On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote:
>> On Wed, 29 Feb 2012, Luigi Rizzo wrote:
>>> 	        |    Actual timeout
>>>               |      select            | poll  | usleep|
>>> 	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
>>> 	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
>>> 	--------+-------+-------+--------+-------+-------+
>>> 	    1      2000      99       6     0      2000
>>> 	   10      2000     109      15     0      2000
>>> 	   50      2000     149      66     0      2000
>>> 	  100      2000     196     133     0      2000
>>> 	  500      2000     597     617     0      2000
>>> 	 1000      2000    1103    1136    2000    2000
>>> 	 1001      3000    1103    1136    2000    3000 <---
>>> 	 1500      3000    1608    1631    2000    3000 <---
>>>      2000	   3000    2096    2127    3000    3000
>>> 	 2001	   4000                    3000    4000 <---
>>> 	 3001	   5000                    4000    5000 <---
>>>
>>> Note how the rounding (poll has the timeout in milliseconds) affects
>>
>> You must have synced with timer interrupts to get the above.  Timeouts
>
> yes i have -- the test code does almost nothing after returning from
> a select, on a system that does some amount of work times could be
> up to 1000us shorter. Still a huge error on short timeouts.

I get the sync but not the rounded timeouts, on my ~5.2 kernel with
HZ = 100.  The times are typically 19900-19993 for rounding up 1 us
to 2 ticks.

> I should also comment that these are average values on an otherwise
> idle system -- i will try to post a histogram of the actual values,
> it might well be that osx and linux have quantized values very
> different from the average (though this would violate the specs,
> so i suspect instead that they have some cheap one-shot timers).
>
> For FreeBSD I have also rounded the bsd values (actual averages are -1/+3us
> over 1sec experiments).

Oh.  The jitter is of minor interest, and rounding to usec should show
an average of slightly less than the timeout rounded up to ticks (on
an unloaded system).

Bakul Shah confirmed that Linux now reprograms the timer.  It has to,
for a tickless kernel.  FreeBSD reprograms timers too.  I think you
can set HZ large and only get timeout interrupts at that frequency if
there are active timeouts that need them.  Timeout granularity is still
1/HZ.

Hmm, this may explain why you are getting exact n000's -- every time
you ask for a timeout, you get one n000 us later (on a near-idle machine
where nothing else is asking for many timeouts), while old kernels
give timeouts on perfectly periodic n000(+error) boundaries; now when
the syscall is made just after a boundary, the boundary for the timeout
is never a full n000 away.  There may be a lot of jitter for both, but
if the reprogramming of the timer when you ask for a new timeout is
too smart, then the jitter will average out to 0, giving perfect n000's.

Try running multiple sources of new timeouts.  I think a periodic
itimer should produce perfectly periodic ones with little overhead.
Then other timeouts should not change the periodicity or even
reprogram the timer.

Reprogramming on demand seems to give unwanted aperiodicity: you ask for
a delay of 1 and get 2000.  Suppose you actually want 2000, and actually
get it relative to the request time.  Then the timer must be interrupting
aperiodically, with an average period of 2000+(overhead time of say 2) 
possibly with large jitter.  So 500 of these take 1 second plus 1000 us,
plus any jitter (the jitter may be negative, but is most likely positive,
since when the process setting up the timeouts is preempted and nothing
else is setting them up, there may be a large additional delay).

I try to avoid this problem in my version of ping.  I try to send a packet
on every 1 second boundary.  Normal ping tries to send one 1 second after
the previous one, but it can't do this since it has overheads and gets
preempted.  With HZ=100 and rounding up and adding 1, the drift is likely
to be 20 msec every second or 2%.  This is quite a lot.  My version tries
to schedule a timeout that expires exactly 1 second after the previous
packet was sent, not 1 second after the current time.  It takes a simple
subtraction to determine the timeout to reach the next seconds boundary,
but determining the times to subtract seems to require an extra
gettimeofday() call.  I should use a periodic itimer and depend on it
actually being periodic.  The kernel must do similar things to keep
periodic itimers actually periodic after it reprograms timers.  There
may be a lot of jitter on each reprogramming, but this can be compensated
for on average.  OTOH, as for skewing clocks, the compensation shouldn't
go too fast in either direction.  This could get complicated.  I don't
know what -current actually does.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 04:45:30 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 34542106564A
	for <arch@freebsd.org>; Thu,  1 Mar 2012 04:45:30 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail30.syd.optusnet.com.au (mail30.syd.optusnet.com.au
	[211.29.133.193])
	by mx1.freebsd.org (Postfix) with ESMTP id C21BB8FC0C
	for <arch@freebsd.org>; Thu,  1 Mar 2012 04:45:29 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail30.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q214jB7Z030524
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 1 Mar 2012 15:45:16 +1100
Date: Thu, 1 Mar 2012 15:45:11 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20120301132806.O2255@besplex.bde.org>
Message-ID: <20120301143042.F2406@besplex.bde.org>
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
	<20120301071145.O879@besplex.bde.org>
	<20120301012315.GB14508@onelab2.iet.unipi.it>
	<20120301132806.O2255@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 04:45:30 -0000

On Thu, 1 Mar 2012, Bruce Evans wrote:

> ...
> Bakul Shah confirmed that Linux now reprograms the timer.  It has to,
> for a tickless kernel.  FreeBSD reprograms timers too.  I think you
> can set HZ large and only get timeout interrupts at that frequency if
> there are active timeouts that need them.  Timeout granularity is still
> 1/HZ.

I tried this in -current and in a 2008 -current with hz=10000.  It worked
mediocrely:
- the 2008 version gave lapic cpuN: timer interrupts on all CPUs at
   frequency of almost exactly 10 kHz.  This is the behaviour before
   FreeBSD reprogrammed timers (except the frequency is often off by
   as much as 10% due to calibration bugs).  There were many anomolies
   in the results from the test program (like select() adding 199 usec
   and usleep() adding 999 usec).
- current gives cpu0: timer interrupts at a frequency of almost
   exactly 10115 Hz, but only when I watch it using systat over the
   network (10000 is Hz and the other 115 is presumaby for reprogramming).
   The other CPU gets many fewer interrupts.  When I stop watching, the
   rates drop towards 9900 for cpu0 and 120 for cpu1.  I hoped that there
   would be only about 50 timer interrupts on the mostly-idle machine.
- timeout granularity according to the test program was better than
   expected.  In almost all cases, the timeout was xx99 us.  E.g., 1
   becomes 200 after rounding up and adding 1 tick, and the result is 199
   (since there was 1 us of overhead and no jitter).  1000 became 1099
   since rounding up didn't increase it.  This is almost better than
   the OtherOS results (since it has no jitter).  I can probably easily
   beat OtherOS by setting hz to 100000.  But I think no jitter is too
   good to be good.

This makes a design bug in poll() very clear.  poll() has a timeout
granularity of 1 ms, so you can't even asks for timeouts of less than
that.  Above 1 ms, the extra 99 or 199 us is good enough, and the default
of an extra 999 or 1999 us is not too bad.

A tickless kernel should have the equivalent of HZ = 0 on idle machines
and the equivalant of HZ = huge when something uses lots of timeouts.
The latter gives some security problems.  You don't want to reprogram
timers ever 500 nsec when some untrusted application asks for timeouts
of 1000 nsec even if the system can support it.  When APIs are fixed to
catch up with 1988's timespecs, it will be possible to ask for timeouts
of 1 nsec and never get them but waste a lot of cycles.  Scheduling is
not good enough to disfavour CPU hogs that do things on the nanoseconds
scale.

I just remembered that precise timeouts are just what is needed for
hiding from schedulers.  stathz was supposed to be significantly
aperiodic and larger than hz so that CPU hogs couldn't use timeouts
(based on hz) to hide from schedulers (based on stathz).  This was
never fully implemented in FreeBSD, and was broken many years ago.
In FreeBSD, stathz was normally 128 and aperiod, and just a little
larger than hz which was normally 100.  But someone broke hz to
default to 1000.  CPU hogs can now not so easily hide from schedulers
by getting timeouts every millisecond and running for about 6 or 7
milliseconds, then sleeping for 2 or 1 millisecond to miss scheduler
ticks.  With larger hz, the hogs get more control.  E.g., HZ = 10000
lets them sleep for only 200 or 100 usec every 78.1 msec to miss
scheduler ticks.  Reprogramming of timers in -current probably gives
significant jitter to timeout boundaries.  This can be handled by
sleeping for a slightly wider interval.  Also, fine-grained timeouts
makes allows simpler implementations of this: just wake up every
tick, and if you are close to a scheduler tick (which you can predict
since they are periodic), then go back to sleep for 1 timeout tick.
Since timeout ticks are short relative to scheduler ticks, you get
control again soon and then don't have to sleep again for many
timeout ticks.  No one cares about this because CPUs are now free :-).

-current has related fixes and complications in new timer code.  Even
without malicious CPU hogs, basing statclock and hardclock on the
same lapic timer made them too synchronous with each other.  The
quick fix was to use the i8254 again.  This gave a small amount
of asynchronicity which was apparently enough to fix the non-
malicious case.  I didn't like this, and tried to generate some
fake asynchronicity in from a single lapic timer.  I think it is
possible to fake it well enough for the non-malicious case.  No
one followed up on this.  I haven't followed later developments.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 05:42:48 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2C4E0106564A
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 05:42:48 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
	[211.29.132.184])
	by mx1.freebsd.org (Postfix) with ESMTP id BACF68FC14
	for <arch@FreeBSD.org>; Thu,  1 Mar 2012 05:42:47 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q215gD7w009742
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 1 Mar 2012 16:42:44 +1100
Date: Thu, 1 Mar 2012 16:42:13 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20120301143042.F2406@besplex.bde.org>
Message-ID: <20120301161011.A2654@besplex.bde.org>
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
	<20120301071145.O879@besplex.bde.org>
	<20120301012315.GB14508@onelab2.iet.unipi.it>
	<20120301132806.O2255@besplex.bde.org>
	<20120301143042.F2406@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 05:42:48 -0000

On Thu, 1 Mar 2012, Bruce Evans wrote:

> On Thu, 1 Mar 2012, Bruce Evans wrote:
>
>> ...
>> Bakul Shah confirmed that Linux now reprograms the timer.  It has to,
>> for a tickless kernel.  FreeBSD reprograms timers too.  I think you
>> can set HZ large and only get timeout interrupts at that frequency if
>> there are active timeouts that need them.  Timeout granularity is still
>> 1/HZ.
>
> I tried this in -current and in a 2008 -current with hz=10000.  It worked
> mediocrely:
> - the 2008 version gave lapic cpuN: timer interrupts on all CPUs at
>  frequency of almost exactly 10 kHz.  This is the behaviour before
>  FreeBSD reprogrammed timers (except the frequency is often off by
>  as much as 10% due to calibration bugs).  There were many anomolies
>  in the results from the test program (like select() adding 199 usec
>  and usleep() adding 999 usec).
> - [... no surprises in -current]

I tried this in -current with hz=100000.  This gives (some not very
surprising) behaviour:
- systat claims ~100% idle, but the ~100k interrupts on 1 CPU actually
   reduces performance by 33% (two CPUs take 30 seconds user time to
   do what can be done in 20 seconds user time with hz=100).  This is
   a normal problem with fast interrupt handlers.  They need a faster
   interrupt handler to account for them properly.
- ./prog 1 select works reasonably.  It reports timeouts of 29-30 us.
   I expected 19-20.
- ./prog 1 poll is broken as we know.  It asks for timeouts of 0 and
   takes 3 us.
- ./prog 1 usleep shows brokenness.  It reports timeouts of 999 us.
   I think this is due to getnanouptime()'s brokenness.
   $(sysctl kern.timecounter.tick) is 100.  This reduces getnanouptime()'s
   accuracy back to to 1 msec, which explains the 999 us.  But why doesn't
   select() have the same problem?  select() uses getmicrouptime(), but
   it has the same brokenness.  The sysctl is r/o, so I couldn't use
   it easily.  I have changed tc_tick using ddb before, but don't want
   to risk reducing it by a factor of 100.  The timecounter update
   algorithm depends on the timehands not being recycled too fast, and
   probably couldn't copy with recycling 100 times faster.
- ./prog 1000 select and ./prog 1000 poll take 20 us extra.  I expected
   9-10 extra.
- ./prog 1000 usleep takes 619-693 us extra.  Not the full extra 100
   ticks from getnanouptime() fuzziness now.
- ./prog 500000 usleep takes 500026-500885 us.  Even higher variance
   which agrees with the fuzziness better.  select and poll with this
   timeout still have accuracy and low variance (21-26 us extra).

The fuzzy versions are actually useful for optimization after all:
- for long timeouts, use the fuzzy versions and accept their inaccuracies.
   Sleep longer by the amount fuzziness so that sleeps are never too
   short.
- for short timeouts, it seems necessary for the initial timestamp to
   be accuarate.  When checking if the timeout has expired, first try a
   fuzzy check.  This is sufficent if the current fuzzy time is far from
   the expiry time.

Bruce

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 11:46:55 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 0AD59106566C;
	Thu,  1 Mar 2012 11:46:55 +0000 (UTC)
	(envelope-from gleb.kurtsou@gmail.com)
Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com
	[209.85.214.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 493D58FC16;
	Thu,  1 Mar 2012 11:46:54 +0000 (UTC)
Received: by bkcjc3 with SMTP id jc3so505027bkc.13
	for <multiple recipients>; Thu, 01 Mar 2012 03:46:53 -0800 (PST)
Received-SPF: pass (google.com: domain of gleb.kurtsou@gmail.com designates
	10.112.10.169 as permitted sender) client-ip=10.112.10.169; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of gleb.kurtsou@gmail.com
	designates 10.112.10.169 as permitted sender)
	smtp.mail=gleb.kurtsou@gmail.com;
	dkim=pass header.i=gleb.kurtsou@gmail.com
Received: from mr.google.com ([10.112.10.169])
	by 10.112.10.169 with SMTP id j9mr2285243lbb.70.1330602413304 (num_hops
	= 1); Thu, 01 Mar 2012 03:46:53 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=date:from:to:cc:subject:message-id:references:mime-version
	:content-type:content-disposition:in-reply-to:user-agent;
	bh=YwAEYHPwri+M8qawFfwO1wxQGn4Fyk+2Z3ik2ui6j6c=;
	b=wUahYKlnXH/SmwD/aCBXzcM6OZy2jpQoAcUITlSyAelIeLK3igpYopnu6B/2pxqfj8
	rYxij/kWrEKmAHNToItBzDSf6eEKfuIKjQ77cl23pXkLp0PHtIRk5iUxUuSuzbV/+Z7S
	kaaAARi8zt/DvRoMtjx7qzDv5PmbXrVV9+HgQ=
Received: by 10.112.10.169 with SMTP id j9mr1820289lbb.70.1330600584015;
	Thu, 01 Mar 2012 03:16:24 -0800 (PST)
Received: from localhost ([78.157.92.5])
	by mx.google.com with ESMTPS id b3sm2460510lby.7.2012.03.01.03.16.22
	(version=SSLv3 cipher=OTHER); Thu, 01 Mar 2012 03:16:22 -0800 (PST)
Date: Thu, 1 Mar 2012 13:16:24 +0200
From: Gleb Kurtsou <gleb.kurtsou@gmail.com>
To: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Message-ID: <20120301111624.GB30991@reks>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20120225194630.GI1344@garage.freebsd.pl>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Attilio Rao <attilio@freebsd.org>,
	Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 11:46:55 -0000

On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
> On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
> > Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd@freebsd.org> ha scritto:
> > > I personal opinion about rangelocks and many other VFS features we
> > > currently have is that it is good idea in theory, but in practise it
> > > tends to overcomplicate VFS.
> > >
> > > I'm in opinion that we should move as much stuff as we can to individual
> > > file systems. We try to implement everything in VFS itself in hope that
> > > this will simplify file systems we have. It then turns out only one file
> > > system is really using this stuff (most of the time it is UFS) and this
> > > is PITA for all the other file systems as well as maintaining VFS. VFS
> > > became so complicated over the years that there are maybe few people
> > > that can understand it, and every single change to VFS is a huge risk of
> > > potentially breaking some unrelated parts.
> > 
> > I think this is questionable due to the following assets:
> > - If the problem is filesystems writers having trouble in
> > understanding the necessary locking we should really provide cleaner
> > and more complete documentation. One would think the same with our VM
> > subsystem, but at least in that case there is plenty of comments that
> > help understanding how to deal with vm_object, vm_pages locking during
> > their lifelines.
> 
> Documentation is not the answer here. If the code is so complex it is
> harder to learn, no matter how good the documentation is, it makes less
> people willing to learn it in the first place and it makes the code more
> buggy, because there are more edge/special cases you can forget about.
> 
> > - Our primitives may be more complicated than the
> > 'all-in-the-filesystem' one, but at least they offer a complete and
> > centralized view over the resources we have allocated in the whole
> > system and they allow building better policies about how to manage
> > them. One problem I see here, is that those policies are not fully
> > implemented, tuned or just got outdated, removing one of the highest
> > beneficial that we have by making vnodes so generic
> 
> Again, this is only nice theory, that is far from being the reality.
> You will never be able to have control on all the resources allocated by
> file systems.
> 
> > About the thing I mentioned myself:
> > - As long as the same path now has both range-locking and vnode
> > locking I don't see as a good idea to keep both separated forever.
> > Merging them seems to me an important evolution (not only helping
> > shrinking the number of primitives themselves but also introducing
> > less overhead and likely rewamped scalability for vnodes (but I think
> > this needs a deep investigation).
> > - About ZFS rangelocks absorbing the VFS ones, I think this is a minor
> > point, but still, if you think it can be done efficiently and without
> > loosing performance I don't see why not do that. You already wrote
> > rangelocks for ZFS, so you are have earned a big experience in this
> > area and can comment on fallouts, etc., but I don't see a good reason
> > to not do that, unless it is just too difficult. This is not about
> > generalizing a new mechanism, it is using a general mechanism in a
> > specific implementation, if possible.
> 
> I did not implement rangelocking for ZFS. It came with ZFS when I ported
> it. Until we want to merge changes from upstream (which is now IllumOS)
> we don't want to make huge changes just for the sake of proving that
> this is general purpose mechanism used by more than one file system.
> 
> Attilio, don't get me wrong. In 99% cases it is good to make code more
> general and more universal and reusable, but we can't ignore reality.
> 
> There are reasons why file systems like XFS, ReiserFS and others where
> never fully ported. I'm not saying VFS complexity was the only reason,
> but I'm sure it was one of them.
> 
> Our VFS is very UFS-centric. We make so many assumptions that sounds
> fine only for UFS. I saw plenty of those while working on ZFS, like:
> 
> - "Every file system needs cache. Let's make it general, so that all file
>   systems can use it!" Well, for VFS each file system is a separate
>   entity, which is not the case for ZFS. ZFS can cache one block only
>   once that is used by one file system, 10 clones and 100 snapshots,
>   which all are separate mount points from VFS perspective.
>   The same block would be cached 111 times by the buffer cache.

Hmm. But this one is optional. Use vop_cachedlookup (or call
cache_entry() on your own), add a number of cache_prune calls. It's
pretty much library-like design you describe below.

> 
> - "rmdir(2) on a mountpoint is bad idea, let's deny it at VFS level."
>   It is bad idea, indeed, but in ZFS it is a nice way to remove snapshot
>   by rmdiring .zfs/snapshot/<name> directory.
> 
> - Noone implemented rangelocking in VFS, so no file system can use it.
>   Even if the given file system has all the code to do it.
> 
> etc.
> 
> I'm also sure it will be way easier for Jeff to make VFS MP-safe if it
> was less complex.

Everybody agrees that VFS needs more care. But there haven't been much
of concrete suggestions or at least there is no VFS TODO list.

> When looking at the big picture, it would be nice to have all this
> general stuff like rangelocking, quota, buffer cache, etc. as some kind
> of libraries for file systems to use and not something that is
> mandatory. If I develop a file system for FreeBSD only and I don't want
> to reinvent the wheel, I can use those libraries. If I port file system
> to FreeBSD or develop a file system that doesn't really need those
> libraries I'm not forced to use them.

Are you aware of a real "libraries for file systems" VFS example? It
sounds very interesting but I'm afraid it's going to look good only in
theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks
rather messy (IMHO) and more likely to be bug prone. On the other side
Linux has optional per file system rename lock making VOP_RENAME
implementation much easier, while ours is tremendously difficult to do
right.

> All this might make a good working group subject at BSDCan devsummit.
> We could cross swords there:)

Unfortunately I'm afraid I won't make there too. And most likely will
miss EuroBSD/MeetBSD 2012 in Warsaw as well. I have a number of fresh
ideas about namecache I'd love to discuss. What do you think about
organising preliminary group meeting on fs@ or arch@? :)

> 
> -- 
> Pawel Jakub Dawidek                       http://www.wheelsystems.com
> FreeBSD committer                         http://www.FreeBSD.org
> Am I Evil? Yes, I Am!                     http://tupytaj.pl



From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 14:10:14 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 38250106564A;
	Thu,  1 Mar 2012 14:10:14 +0000 (UTC) (envelope-from
	BATV+c43791b1943af153f85b+3112+infradead.org+hch@bombadil.srs.infradead.org)
Received: from bombadil.infradead.org (bombadil.infradead.org
	[IPv6:2001:4830:2446:ff00:4687:fcff:fea6:5117])
	by mx1.freebsd.org (Postfix) with ESMTP id D66448FC12;
	Thu,  1 Mar 2012 14:10:12 +0000 (UTC)
Received: from hch by bombadil.infradead.org with local (Exim 4.76 #1 (Red Hat
	Linux)) id 1S36hu-0004G6-VV; Thu, 01 Mar 2012 14:10:11 +0000
Date: Thu, 1 Mar 2012 09:10:10 -0500
From: Christoph Hellwig <hch@infradead.org>
To: Gleb Kurtsou <gleb.kurtsou@gmail.com>
Message-ID: <20120301141010.GA7079@infradead.org>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20120301111624.GB30991@reks>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-SRS-Rewrite: SMTP reverse-path rewritten from <hch@infradead.org> by
	bombadil.infradead.org See http://www.infradead.org/rpr.html
Cc: Attilio Rao <attilio@freebsd.org>,
	Konstantin Belousov <kostikbel@gmail.com>,
	Pawel Jakub Dawidek <pjd@FreeBSD.org>, arch@freebsd.org
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 14:10:14 -0000

On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> Are you aware of a real "libraries for file systems" VFS example? It
> sounds very interesting but I'm afraid it's going to look good only in
> theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks
> rather messy (IMHO) and more likely to be bug prone. On the other side
> Linux has optional per file system rename lock making VOP_RENAME
> implementation much easier, while ours is tremendously difficult to do
> right.

All namespace locking in Linux is in the VFS, and it mandatory.  A
filesystem wide lock is only used for cross-directory renames.

A more detailed description is here:

	http://git.kernel.dk/?p=linux.git;a=blob;f=Documentation/filesystems/directory-locking


From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 14:14:07 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E0D921065672;
	Thu,  1 Mar 2012 14:14:07 +0000 (UTC)
	(envelope-from pawel@dawidek.net)
Received: from mail.dawidek.net (60.wheelsystems.com [83.12.187.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 4FE0D8FC0A;
	Thu,  1 Mar 2012 14:14:07 +0000 (UTC)
Received: from localhost (58.wheelsystems.com [83.12.187.58])
	by mail.dawidek.net (Postfix) with ESMTPSA id 19F6C12D;
	Thu,  1 Mar 2012 15:14:05 +0100 (CET)
Date: Thu, 1 Mar 2012 15:12:47 +0100
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
To: Gleb Kurtsou <gleb.kurtsou@gmail.com>
Message-ID: <20120301141247.GE1336@garage.freebsd.pl>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="qp4W5+cUSnZs0RIF"
Content-Disposition: inline
In-Reply-To: <20120301111624.GB30991@reks>
X-OS: FreeBSD 10.0-CURRENT amd64
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Attilio Rao <attilio@freebsd.org>,
	Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 14:14:08 -0000


--qp4W5+cUSnZs0RIF
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
> > - "Every file system needs cache. Let's make it general, so that all fi=
le
> >   systems can use it!" Well, for VFS each file system is a separate
> >   entity, which is not the case for ZFS. ZFS can cache one block only
> >   once that is used by one file system, 10 clones and 100 snapshots,
> >   which all are separate mount points from VFS perspective.
> >   The same block would be cached 111 times by the buffer cache.
>=20
> Hmm. But this one is optional. Use vop_cachedlookup (or call
> cache_entry() on your own), add a number of cache_prune calls. It's
> pretty much library-like design you describe below.

Yes, namecache is already library-like, but I was talking about the
buffer cache. I managed to bypass it eventually with suggestions from
ups@, but for a long time I was sure it isn't at all possible.

> Everybody agrees that VFS needs more care. But there haven't been much
> of concrete suggestions or at least there is no VFS TODO list.

Everybody agrees on that, true, but we disagree on the direction we
should move our VFS, ie. make it more light-weight vs. more heavy-weight.

> > When looking at the big picture, it would be nice to have all this
> > general stuff like rangelocking, quota, buffer cache, etc. as some kind
> > of libraries for file systems to use and not something that is
> > mandatory. If I develop a file system for FreeBSD only and I don't want
> > to reinvent the wheel, I can use those libraries. If I port file system
> > to FreeBSD or develop a file system that doesn't really need those
> > libraries I'm not forced to use them.
>=20
> Are you aware of a real "libraries for file systems" VFS example? It
> sounds very interesting but I'm afraid it's going to look good only in
> theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks
> rather messy (IMHO) and more likely to be bug prone. On the other side
> Linux has optional per file system rename lock making VOP_RENAME
> implementation much easier, while ours is tremendously difficult to do
> right.

There are not many examples for such libraries, but the namecache is one
of them. Things like rangelocking definiately look like a good candidate
to make it a library.

> > All this might make a good working group subject at BSDCan devsummit.
> > We could cross swords there:)
>=20
> Unfortunately I'm afraid I won't make there too. And most likely will
> miss EuroBSD/MeetBSD 2012 in Warsaw as well. I have a number of fresh
> ideas about namecache I'd love to discuss. What do you think about
> organising preliminary group meeting on fs@ or arch@? :)

Sounds good. Both forums seems suitable, just pick one.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
FreeBSD committer                         http://www.FreeBSD.org
Am I Evil? Yes, I Am!                     http://tupytaj.pl

--qp4W5+cUSnZs0RIF
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)

iEYEARECAAYFAk9Pg98ACgkQForvXbEpPzSpdACfdxehVqvpgF/3wXtT3OJCIw0Z
GOMAoKlqRr5LjBU7koitFf+7VGbMC6z+
=4IE/
-----END PGP SIGNATURE-----

--qp4W5+cUSnZs0RIF--

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 14:16:02 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E136D106564A;
	Thu,  1 Mar 2012 14:16:02 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 33CD68FC15;
	Thu,  1 Mar 2012 14:16:01 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21EFsRB090647;
	Thu, 1 Mar 2012 16:15:54 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q21EFrpx074450; Thu, 1 Mar 2012 16:15:53 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21EFrg1074449; 
	Thu, 1 Mar 2012 16:15:53 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Thu, 1 Mar 2012 16:15:53 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Message-ID: <20120301141553.GT55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="nlOp58TzLcTjnOVc"
Content-Disposition: inline
In-Reply-To: <20120301141247.GE1336@garage.freebsd.pl>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: Attilio Rao <attilio@FreeBSD.org>, arch@FreeBSD.org,
	Gleb Kurtsou <gleb.kurtsou@gmail.com>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 14:16:03 -0000


--nlOp58TzLcTjnOVc
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 01, 2012 at 03:12:47PM +0100, Pawel Jakub Dawidek wrote:
> On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> > On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
> > > - "Every file system needs cache. Let's make it general, so that all =
file
> > >   systems can use it!" Well, for VFS each file system is a separate
> > >   entity, which is not the case for ZFS. ZFS can cache one block only
> > >   once that is used by one file system, 10 clones and 100 snapshots,
> > >   which all are separate mount points from VFS perspective.
> > >   The same block would be cached 111 times by the buffer cache.
> >=20
> > Hmm. But this one is optional. Use vop_cachedlookup (or call
> > cache_entry() on your own), add a number of cache_prune calls. It's
> > pretty much library-like design you describe below.
>=20
> Yes, namecache is already library-like, but I was talking about the
> buffer cache. I managed to bypass it eventually with suggestions from
> ups@, but for a long time I was sure it isn't at all possible.

I am quite curious, in which way buffer layer is mandatory ?

--nlOp58TzLcTjnOVc
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk9PhJkACgkQC3+MBN1Mb4gPEACgz/9StyTUKFfToGQFVaUgJWpq
SI8An0aCnA/fz8EySQ7u1IrO3JxLSIRr
=4S1J
-----END PGP SIGNATURE-----

--nlOp58TzLcTjnOVc--

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 14:28:45 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E5CBD1065670;
	Thu,  1 Mar 2012 14:28:45 +0000 (UTC)
	(envelope-from pawel@dawidek.net)
Received: from mail.dawidek.net (60.wheelsystems.com [83.12.187.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 8F5318FC08;
	Thu,  1 Mar 2012 14:28:45 +0000 (UTC)
Received: from localhost (58.wheelsystems.com [83.12.187.58])
	by mail.dawidek.net (Postfix) with ESMTPSA id 169C413C;
	Thu,  1 Mar 2012 15:28:44 +0100 (CET)
Date: Thu, 1 Mar 2012 15:27:27 +0100
From: Pawel Jakub Dawidek <pjd@FreeBSD.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Message-ID: <20120301142726.GF1336@garage.freebsd.pl>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<20120301141553.GT55074@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="KR/qxknboQ7+Tpez"
Content-Disposition: inline
In-Reply-To: <20120301141553.GT55074@deviant.kiev.zoral.com.ua>
X-OS: FreeBSD 10.0-CURRENT amd64
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Attilio Rao <attilio@FreeBSD.org>, arch@FreeBSD.org,
	Gleb Kurtsou <gleb.kurtsou@gmail.com>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 14:28:46 -0000


--KR/qxknboQ7+Tpez
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 01, 2012 at 04:15:53PM +0200, Konstantin Belousov wrote:
> On Thu, Mar 01, 2012 at 03:12:47PM +0100, Pawel Jakub Dawidek wrote:
> > Yes, namecache is already library-like, but I was talking about the
> > buffer cache. I managed to bypass it eventually with suggestions from
> > ups@, but for a long time I was sure it isn't at all possible.
>=20
> I am quite curious, in which way buffer layer is mandatory ?

As I said, it is not, but it took me a while to figure it out.
I remember having massive problems when I was working on getting mmaped
reads/writes right and bypassing the buffer cache and talking to the
page cache directly. I don't think there was single example in the tree
that was showing it can be done at that time. Currently tmpfs is using
the same approach as ZFS, AFAIK.

--=20
Pawel Jakub Dawidek                       http://www.wheelsystems.com
FreeBSD committer                         http://www.FreeBSD.org
Am I Evil? Yes, I Am!                     http://tupytaj.pl

--KR/qxknboQ7+Tpez
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)

iEYEARECAAYFAk9Ph04ACgkQForvXbEpPzS4gwCgiqSLlzrJ2LRC4FHPSOVsjCQd
ZbwAn1yCaWUq3kik4zzQ+ClcPCQsUpbk
=LM1U
-----END PGP SIGNATURE-----

--KR/qxknboQ7+Tpez--

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 14:32:36 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 48764106564A;
	Thu,  1 Mar 2012 14:32:35 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 82FA68FC15;
	Thu,  1 Mar 2012 14:32:34 +0000 (UTC)
Received: by lagv3 with SMTP id v3so1128172lag.13
	for <multiple recipients>; Thu, 01 Mar 2012 06:32:33 -0800 (PST)
Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates
	10.152.130.234 as permitted sender) client-ip=10.152.130.234; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of asmrookie@gmail.com
	designates 10.152.130.234 as permitted sender)
	smtp.mail=asmrookie@gmail.com;
	dkim=pass header.i=asmrookie@gmail.com
Received: from mr.google.com ([10.152.130.234])
	by 10.152.130.234 with SMTP id oh10mr5287243lab.35.1330612353335
	(num_hops = 1); Thu, 01 Mar 2012 06:32:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=CF49K3DG02gnM8s0VOwA8l3F/1p+FTtEU/X2ha2/Gl0=;
	b=XdzsmARVScG7NzwxZEDa4bbrIMotavnW0JUFPi1dkuEiEAnIhGsibGGkGbZi3WcR/k
	PXHqi8ew3amXf8w+Zn/AjyeyGZ0jkkIowWeio8yp8AWLl82N/cQ1NBgKNpQjk/sahuJ5
	Rm2Qm1Eza0V9nmtmdhxKz/vx/cHR4S2vxtYBE=
MIME-Version: 1.0
Received: by 10.152.130.234 with SMTP id oh10mr4299652lab.35.1330612353193;
	Thu, 01 Mar 2012 06:32:33 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 06:32:33 -0800 (PST)
In-Reply-To: <20120301141247.GE1336@garage.freebsd.pl>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
Date: Thu, 1 Mar 2012 14:32:33 +0000
X-Google-Sender-Auth: W8QAl2NJ7vStiNdX2HLNoulbfYk
Message-ID: <CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Pawel Jakub Dawidek <pjd@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Cc: Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org,
	Gleb Kurtsou <gleb.kurtsou@gmail.com>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 14:32:36 -0000

2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
> On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
>> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
>> > - "Every file system needs cache. Let's make it general, so that all
>> > file
>> >   systems can use it!" Well, for VFS each file system is a separate
>> >   entity, which is not the case for ZFS. ZFS can cache one block only
>> >   once that is used by one file system, 10 clones and 100 snapshots,
>> >   which all are separate mount points from VFS perspective.
>> >   The same block would be cached 111 times by the buffer cache.
>>
>> Hmm. But this one is optional. Use vop_cachedlookup (or call
>> cache_entry() on your own), add a number of cache_prune calls. It's
>> pretty much library-like design you describe below.
>
> Yes, namecache is already library-like, but I was talking about the
> buffer cache. I managed to bypass it eventually with suggestions from
> ups@, but for a long time I was sure it isn't at all possible.

Can you please clarify on this as I really don't understand what you mean?

>
>> Everybody agrees that VFS needs more care. But there haven't been much
>> of concrete suggestions or at least there is no VFS TODO list.
>
> Everybody agrees on that, true, but we disagree on the direction we
> should move our VFS, ie. make it more light-weight vs. more heavy-weight.

All I'm saying (and Gleb too) is that I don't see any benefit in
replicating all the vnodes lifecycle at the inode level and in the
filesystem specific implementation.
I don't see a semplification in the work to do, I don't think this is
going to be simpler for a single specific filesystem (without
mentioning the legacy support, which means re-implement inode handling
for every filesystem we have now), we just loose generality.

if you want a good example of a VFS primitive that was really
UFS-centric and it was mistakenly made generic is vn_start_write() and
sibillings. I guess it was introduced just to cater UFS snapshot
creation and then it poisoned other consumers.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 14:36:21 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 4955F1065675;
	Thu,  1 Mar 2012 14:36:21 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com
	[209.85.215.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 7AD078FC14;
	Thu,  1 Mar 2012 14:36:20 +0000 (UTC)
Received: by eaaf13 with SMTP id f13so222528eaa.13
	for <multiple recipients>; Thu, 01 Mar 2012 06:36:19 -0800 (PST)
Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates
	10.112.27.199 as permitted sender) client-ip=10.112.27.199; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of asmrookie@gmail.com
	designates 10.112.27.199 as permitted sender)
	smtp.mail=asmrookie@gmail.com;
	dkim=pass header.i=asmrookie@gmail.com
Received: from mr.google.com ([10.112.27.199])
	by 10.112.27.199 with SMTP id v7mr2458137lbg.36.1330612579401 (num_hops
	= 1); Thu, 01 Mar 2012 06:36:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=eGvqhipt++SM96LhvBDNx+Ta1D3DuVoerEvDM1MMmlk=;
	b=QuQ3durY5OIXeavgi0K9HIuoa2VVgvIeqE7iOU5x/yjpUfTAIKOnq01m8bFlr3cZdn
	pixX2etJAtbihQGXybE6x4D68vWY9E3vRIai2hhG/RWLVf03gp0BJZ9i+ZAqFhQUwJ1Y
	35bvu+1d1PRfnelCv75ltxfm8CQGDSauFWCSY=
MIME-Version: 1.0
Received: by 10.112.27.199 with SMTP id v7mr2009638lbg.36.1330612579301; Thu,
	01 Mar 2012 06:36:19 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 06:36:19 -0800 (PST)
In-Reply-To: <20120301111624.GB30991@reks>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
Date: Thu, 1 Mar 2012 14:36:19 +0000
X-Google-Sender-Auth: a0TLwnCjBFXEHM_4CRt8Ad0AM1Q
Message-ID: <CAJ-FndAfQz5UvnMe3PaNKmjmUy08xLAm37W68HgX-UNkmH8t_Q@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Gleb Kurtsou <gleb.kurtsou@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: Konstantin Belousov <kostikbel@gmail.com>, arch@freebsd.org,
	Jeff Roberson <jeff@freebsd.org>, Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 14:36:21 -0000

2012/3/1, Gleb Kurtsou <gleb.kurtsou@gmail.com>:
> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:

[snip]

>> When looking at the big picture, it would be nice to have all this
>> general stuff like rangelocking, quota, buffer cache, etc. as some kind
>> of libraries for file systems to use and not something that is
>> mandatory. If I develop a file system for FreeBSD only and I don't want
>> to reinvent the wheel, I can use those libraries. If I port file system
>> to FreeBSD or develop a file system that doesn't really need those
>> libraries I'm not forced to use them.
>
> Are you aware of a real "libraries for file systems" VFS example? It
> sounds very interesting but I'm afraid it's going to look good only in
> theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks
> rather messy (IMHO) and more likely to be bug prone. On the other side
> Linux has optional per file system rename lock making VOP_RENAME
> implementation much easier, while ours is tremendously difficult to do
> right.

I think Jeff (CC'ed) had fixed this (maybe only for UFS, cannot recall
now) and he had a very good reason for not using Linux approach, which
I don't recall now.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 14:47:20 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 430CD106566C;
	Thu,  1 Mar 2012 14:47:20 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id C5DB68FC19;
	Thu,  1 Mar 2012 14:47:19 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21El8Yv094699;
	Thu, 1 Mar 2012 16:47:08 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q21El82I074728; Thu, 1 Mar 2012 16:47:08 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21El8bh074727; 
	Thu, 1 Mar 2012 16:47:08 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Thu, 1 Mar 2012 16:47:08 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Message-ID: <20120301144708.GV55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="u2jkDaBVK38P9/ME"
Content-Disposition: inline
In-Reply-To: <CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 14:47:20 -0000


--u2jkDaBVK38P9/ME
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
> >> > - "Every file system needs cache. Let's make it general, so that all
> >> > file
> >> >   systems can use it!" Well, for VFS each file system is a separate
> >> >   entity, which is not the case for ZFS. ZFS can cache one block only
> >> >   once that is used by one file system, 10 clones and 100 snapshots,
> >> >   which all are separate mount points from VFS perspective.
> >> >   The same block would be cached 111 times by the buffer cache.
> >>
> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
> >> cache_entry() on your own), add a number of cache_prune calls. It's
> >> pretty much library-like design you describe below.
> >
> > Yes, namecache is already library-like, but I was talking about the
> > buffer cache. I managed to bypass it eventually with suggestions from
> > ups@, but for a long time I was sure it isn't at all possible.
>=20
> Can you please clarify on this as I really don't understand what you mean?
>=20
> >
> >> Everybody agrees that VFS needs more care. But there haven't been much
> >> of concrete suggestions or at least there is no VFS TODO list.
> >
> > Everybody agrees on that, true, but we disagree on the direction we
> > should move our VFS, ie. make it more light-weight vs. more heavy-weigh=
t.
>=20
> All I'm saying (and Gleb too) is that I don't see any benefit in
> replicating all the vnodes lifecycle at the inode level and in the
> filesystem specific implementation.
> I don't see a semplification in the work to do, I don't think this is
> going to be simpler for a single specific filesystem (without
> mentioning the legacy support, which means re-implement inode handling
> for every filesystem we have now), we just loose generality.
>=20
> if you want a good example of a VFS primitive that was really
> UFS-centric and it was mistakenly made generic is vn_start_write() and
> sibillings. I guess it was introduced just to cater UFS snapshot
> creation and then it poisoned other consumers.

vn_start_write() has nothing to do with filesystem code at all.
It is purely VFS layer operation, which shall not be called from fs
code at all. vn_start_secondary_write() is sometimes useful for the
filesystem itself.

Suspension (not snapshotting) is very useful and allows to avoid some
nasty issues with unmounts, remounts or guaranteed syncing of the
filesystem. The fact that only UFS utilizes this functionality just
shows that other filesystem implementors do not care about this
correctness, or that other filesystems are not maintained.

--u2jkDaBVK38P9/ME
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk9Pi+sACgkQC3+MBN1Mb4j+DQCgzNdcihDFivaI+KoVGwEIcmRX
LwMAnRAVHLgnFi+aeFHTTtPjRfwSLuQg
=dNA5
-----END PGP SIGNATURE-----

--u2jkDaBVK38P9/ME--

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 14:50:42 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 552651065670;
	Thu,  1 Mar 2012 14:50:42 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 9E3CD8FC0A;
	Thu,  1 Mar 2012 14:50:41 +0000 (UTC)
Received: by eekd17 with SMTP id d17so238954eek.13
	for <multiple recipients>; Thu, 01 Mar 2012 06:50:40 -0800 (PST)
Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates
	10.112.9.34 as permitted sender) client-ip=10.112.9.34; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of asmrookie@gmail.com
	designates 10.112.9.34 as permitted sender)
	smtp.mail=asmrookie@gmail.com;
	dkim=pass header.i=asmrookie@gmail.com
Received: from mr.google.com ([10.112.9.34])
	by 10.112.9.34 with SMTP id w2mr2505416lba.50.1330613440553 (num_hops =
	1); Thu, 01 Mar 2012 06:50:40 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=mA9gX02h1E1X8c4klSuRIB+v++7vt1gKfpjb4ubkk7g=;
	b=lBj1Ms8fEZWLMyo4yzdIWwNkR+kmuPIxVq2bRgl2LMnjbiTPk6+pCw1lRvUEvFFZ6W
	4GDm0GMNB4VetI7eGcbQzbQtNpQ9MkILxnG+rEVeO49HnDox9tomIxM8Qe2+W0IWzo0b
	6bE7IARgvFvno7yFOOm4nd6vd+diSDZrNimUI=
MIME-Version: 1.0
Received: by 10.112.9.34 with SMTP id w2mr2039040lba.50.1330613440418; Thu, 01
	Mar 2012 06:50:40 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 06:50:40 -0800 (PST)
In-Reply-To: <20120301144708.GV55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
	<20120301144708.GV55074@deviant.kiev.zoral.com.ua>
Date: Thu, 1 Mar 2012 14:50:40 +0000
X-Google-Sender-Auth: 4lXrWlAYMBLYVKSXPymZfwnl-Ps
Message-ID: <CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 14:50:42 -0000

2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
>> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
>> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
>> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
>> >> > - "Every file system needs cache. Let's make it general, so that all
>> >> > file
>> >> >   systems can use it!" Well, for VFS each file system is a separate
>> >> >   entity, which is not the case for ZFS. ZFS can cache one block only
>> >> >   once that is used by one file system, 10 clones and 100 snapshots,
>> >> >   which all are separate mount points from VFS perspective.
>> >> >   The same block would be cached 111 times by the buffer cache.
>> >>
>> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
>> >> cache_entry() on your own), add a number of cache_prune calls. It's
>> >> pretty much library-like design you describe below.
>> >
>> > Yes, namecache is already library-like, but I was talking about the
>> > buffer cache. I managed to bypass it eventually with suggestions from
>> > ups@, but for a long time I was sure it isn't at all possible.
>>
>> Can you please clarify on this as I really don't understand what you mean?
>>
>> >
>> >> Everybody agrees that VFS needs more care. But there haven't been much
>> >> of concrete suggestions or at least there is no VFS TODO list.
>> >
>> > Everybody agrees on that, true, but we disagree on the direction we
>> > should move our VFS, ie. make it more light-weight vs. more
>> > heavy-weight.
>>
>> All I'm saying (and Gleb too) is that I don't see any benefit in
>> replicating all the vnodes lifecycle at the inode level and in the
>> filesystem specific implementation.
>> I don't see a semplification in the work to do, I don't think this is
>> going to be simpler for a single specific filesystem (without
>> mentioning the legacy support, which means re-implement inode handling
>> for every filesystem we have now), we just loose generality.
>>
>> if you want a good example of a VFS primitive that was really
>> UFS-centric and it was mistakenly made generic is vn_start_write() and
>> sibillings. I guess it was introduced just to cater UFS snapshot
>> creation and then it poisoned other consumers.
>
> vn_start_write() has nothing to do with filesystem code at all.
> It is purely VFS layer operation, which shall not be called from fs
> code at all. vn_start_secondary_write() is sometimes useful for the
> filesystem itself.
>
> Suspension (not snapshotting) is very useful and allows to avoid some
> nasty issues with unmounts, remounts or guaranteed syncing of the
> filesystem. The fact that only UFS utilizes this functionality just
> shows that other filesystem implementors do not care about this
> correctness, or that other filesystems are not maintained.

I'm sure that when I looked into it only UFS suspension was being
touched by it and it was introduced back in the days when snapshotting
was sanitized.

So what are the races it is supposed to fix and other filesystems
don't care about?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 15:01:36 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 459E2106564A;
	Thu,  1 Mar 2012 15:01:36 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 79B098FC1B;
	Thu,  1 Mar 2012 15:01:34 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21F1QfH096322;
	Thu, 1 Mar 2012 17:01:26 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q21F1PNW074838; Thu, 1 Mar 2012 17:01:25 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21F1PYs074837; 
	Thu, 1 Mar 2012 17:01:25 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Thu, 1 Mar 2012 17:01:25 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Message-ID: <20120301150125.GX55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
	<20120301144708.GV55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="o+ErJpKw5D0ndpyV"
Content-Disposition: inline
In-Reply-To: <CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 15:01:36 -0000


--o+ErJpKw5D0ndpyV
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote:
> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
> >> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
> >> >> > - "Every file system needs cache. Let's make it general, so that =
all
> >> >> > file
> >> >> >   systems can use it!" Well, for VFS each file system is a separa=
te
> >> >> >   entity, which is not the case for ZFS. ZFS can cache one block =
only
> >> >> >   once that is used by one file system, 10 clones and 100 snapsho=
ts,
> >> >> >   which all are separate mount points from VFS perspective.
> >> >> >   The same block would be cached 111 times by the buffer cache.
> >> >>
> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
> >> >> cache_entry() on your own), add a number of cache_prune calls. It's
> >> >> pretty much library-like design you describe below.
> >> >
> >> > Yes, namecache is already library-like, but I was talking about the
> >> > buffer cache. I managed to bypass it eventually with suggestions from
> >> > ups@, but for a long time I was sure it isn't at all possible.
> >>
> >> Can you please clarify on this as I really don't understand what you m=
ean?
> >>
> >> >
> >> >> Everybody agrees that VFS needs more care. But there haven't been m=
uch
> >> >> of concrete suggestions or at least there is no VFS TODO list.
> >> >
> >> > Everybody agrees on that, true, but we disagree on the direction we
> >> > should move our VFS, ie. make it more light-weight vs. more
> >> > heavy-weight.
> >>
> >> All I'm saying (and Gleb too) is that I don't see any benefit in
> >> replicating all the vnodes lifecycle at the inode level and in the
> >> filesystem specific implementation.
> >> I don't see a semplification in the work to do, I don't think this is
> >> going to be simpler for a single specific filesystem (without
> >> mentioning the legacy support, which means re-implement inode handling
> >> for every filesystem we have now), we just loose generality.
> >>
> >> if you want a good example of a VFS primitive that was really
> >> UFS-centric and it was mistakenly made generic is vn_start_write() and
> >> sibillings. I guess it was introduced just to cater UFS snapshot
> >> creation and then it poisoned other consumers.
> >
> > vn_start_write() has nothing to do with filesystem code at all.
> > It is purely VFS layer operation, which shall not be called from fs
> > code at all. vn_start_secondary_write() is sometimes useful for the
> > filesystem itself.
> >
> > Suspension (not snapshotting) is very useful and allows to avoid some
> > nasty issues with unmounts, remounts or guaranteed syncing of the
> > filesystem. The fact that only UFS utilizes this functionality just
> > shows that other filesystem implementors do not care about this
> > correctness, or that other filesystems are not maintained.
>=20
> I'm sure that when I looked into it only UFS suspension was being
> touched by it and it was introduced back in the days when snapshotting
> was sanitized.
>=20
> So what are the races it is supposed to fix and other filesystems
> don't care about?

You cannot reliably sync the filesystem when other writers are active.
So, for instance, loop over vnodes fsyncing them in unmount code can never=
=20
terminate. The same is true for remounts rw->ro.

One of the possible solution there is to suspend writers. If unmount is
successfull, writer will get a failure from vn_start_write() call, while
it will proceed normal if unmount is terminated or not started at all.

Another (proper) example of suspension use is gjournal.


--o+ErJpKw5D0ndpyV
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk9Pj0UACgkQC3+MBN1Mb4gzZACfYeiuRg03EuxoUfK6NjsPNMbx
Gn4AoIjglsR1+n6ZBjpK4y2BFXmDd1ly
=/m0G
-----END PGP SIGNATURE-----

--o+ErJpKw5D0ndpyV--

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 15:11:18 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 2AB101065672;
	Thu,  1 Mar 2012 15:11:18 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 7A0C48FC12;
	Thu,  1 Mar 2012 15:11:17 +0000 (UTC)
Received: by eekd17 with SMTP id d17so253248eek.13
	for <multiple recipients>; Thu, 01 Mar 2012 07:11:16 -0800 (PST)
Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates
	10.112.9.34 as permitted sender) client-ip=10.112.9.34; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of asmrookie@gmail.com
	designates 10.112.9.34 as permitted sender)
	smtp.mail=asmrookie@gmail.com;
	dkim=pass header.i=asmrookie@gmail.com
Received: from mr.google.com ([10.112.9.34])
	by 10.112.9.34 with SMTP id w2mr2545454lba.50.1330614676462 (num_hops =
	1); Thu, 01 Mar 2012 07:11:16 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=7uAs/tmWbzC6/wr3qSuwtfYxQxRiocUfZf7/D4d583M=;
	b=BQ49iAOcE/voiKi4gCH8gIidDjjfXohLxk18U0k4SxxjijmjSDB1qTzgjFFIq+1MBC
	b77zOxS78xIGJPwQzxnHjYfzMbU44c9YV184oT0gHVgWreEIEyGf+4ovm/cLrBXQAmQ/
	9T/sPv6rmOqufpBYJebSAsoh/J6skE2t08SVQ=
MIME-Version: 1.0
Received: by 10.112.9.34 with SMTP id w2mr2071242lba.50.1330614676363; Thu, 01
	Mar 2012 07:11:16 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 07:11:16 -0800 (PST)
In-Reply-To: <20120301150125.GX55074@deviant.kiev.zoral.com.ua>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
	<20120301144708.GV55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
	<20120301150125.GX55074@deviant.kiev.zoral.com.ua>
Date: Thu, 1 Mar 2012 15:11:16 +0000
X-Google-Sender-Auth: gJ-0HKxicl_VFuRPeeNGI95A3gU
Message-ID: <CAJ-FndA=ETSTLCxG1=6G4D0ypaqQB7pDiC=VO==gDyz1BrRWFA@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 15:11:18 -0000

2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote:
>> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
>> >> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
>> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
>> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
>> >> >> > - "Every file system needs cache. Let's make it general, so that
>> >> >> > all
>> >> >> > file
>> >> >> >   systems can use it!" Well, for VFS each file system is a
>> >> >> > separate
>> >> >> >   entity, which is not the case for ZFS. ZFS can cache one block
>> >> >> > only
>> >> >> >   once that is used by one file system, 10 clones and 100
>> >> >> > snapshots,
>> >> >> >   which all are separate mount points from VFS perspective.
>> >> >> >   The same block would be cached 111 times by the buffer cache.
>> >> >>
>> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
>> >> >> cache_entry() on your own), add a number of cache_prune calls. It's
>> >> >> pretty much library-like design you describe below.
>> >> >
>> >> > Yes, namecache is already library-like, but I was talking about the
>> >> > buffer cache. I managed to bypass it eventually with suggestions from
>> >> > ups@, but for a long time I was sure it isn't at all possible.
>> >>
>> >> Can you please clarify on this as I really don't understand what you
>> >> mean?
>> >>
>> >> >
>> >> >> Everybody agrees that VFS needs more care. But there haven't been
>> >> >> much
>> >> >> of concrete suggestions or at least there is no VFS TODO list.
>> >> >
>> >> > Everybody agrees on that, true, but we disagree on the direction we
>> >> > should move our VFS, ie. make it more light-weight vs. more
>> >> > heavy-weight.
>> >>
>> >> All I'm saying (and Gleb too) is that I don't see any benefit in
>> >> replicating all the vnodes lifecycle at the inode level and in the
>> >> filesystem specific implementation.
>> >> I don't see a semplification in the work to do, I don't think this is
>> >> going to be simpler for a single specific filesystem (without
>> >> mentioning the legacy support, which means re-implement inode handling
>> >> for every filesystem we have now), we just loose generality.
>> >>
>> >> if you want a good example of a VFS primitive that was really
>> >> UFS-centric and it was mistakenly made generic is vn_start_write() and
>> >> sibillings. I guess it was introduced just to cater UFS snapshot
>> >> creation and then it poisoned other consumers.
>> >
>> > vn_start_write() has nothing to do with filesystem code at all.
>> > It is purely VFS layer operation, which shall not be called from fs
>> > code at all. vn_start_secondary_write() is sometimes useful for the
>> > filesystem itself.
>> >
>> > Suspension (not snapshotting) is very useful and allows to avoid some
>> > nasty issues with unmounts, remounts or guaranteed syncing of the
>> > filesystem. The fact that only UFS utilizes this functionality just
>> > shows that other filesystem implementors do not care about this
>> > correctness, or that other filesystems are not maintained.
>>
>> I'm sure that when I looked into it only UFS suspension was being
>> touched by it and it was introduced back in the days when snapshotting
>> was sanitized.
>>
>> So what are the races it is supposed to fix and other filesystems
>> don't care about?
>
> You cannot reliably sync the filesystem when other writers are active.
> So, for instance, loop over vnodes fsyncing them in unmount code can never
> terminate. The same is true for remounts rw->ro.
>
> One of the possible solution there is to suspend writers. If unmount is
> successfull, writer will get a failure from vn_start_write() call, while
> it will proceed normal if unmount is terminated or not started at all.

I don't think we implement that right now, IIRC, but it is an interesting idea.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 15:16:51 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 88E19106566C;
	Thu,  1 Mar 2012 15:16:51 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id C04938FC1E;
	Thu,  1 Mar 2012 15:16:50 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21FGhhU097853;
	Thu, 1 Mar 2012 17:16:43 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q21FGgar075099; Thu, 1 Mar 2012 17:16:42 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21FGglh075098; 
	Thu, 1 Mar 2012 17:16:42 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Thu, 1 Mar 2012 17:16:42 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Message-ID: <20120301151642.GY55074@deviant.kiev.zoral.com.ua>
References: <20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
	<20120301144708.GV55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
	<20120301150125.GX55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndA=ETSTLCxG1=6G4D0ypaqQB7pDiC=VO==gDyz1BrRWFA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="k/PDUuKPvLVdBXpq"
Content-Disposition: inline
In-Reply-To: <CAJ-FndA=ETSTLCxG1=6G4D0ypaqQB7pDiC=VO==gDyz1BrRWFA@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 15:16:51 -0000


--k/PDUuKPvLVdBXpq
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote:
> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote:
> >> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
> >> >> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
> >> >> >> > - "Every file system needs cache. Let's make it general, so th=
at
> >> >> >> > all
> >> >> >> > file
> >> >> >> >   systems can use it!" Well, for VFS each file system is a
> >> >> >> > separate
> >> >> >> >   entity, which is not the case for ZFS. ZFS can cache one blo=
ck
> >> >> >> > only
> >> >> >> >   once that is used by one file system, 10 clones and 100
> >> >> >> > snapshots,
> >> >> >> >   which all are separate mount points from VFS perspective.
> >> >> >> >   The same block would be cached 111 times by the buffer cache.
> >> >> >>
> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
> >> >> >> cache_entry() on your own), add a number of cache_prune calls. I=
t's
> >> >> >> pretty much library-like design you describe below.
> >> >> >
> >> >> > Yes, namecache is already library-like, but I was talking about t=
he
> >> >> > buffer cache. I managed to bypass it eventually with suggestions =
from
> >> >> > ups@, but for a long time I was sure it isn't at all possible.
> >> >>
> >> >> Can you please clarify on this as I really don't understand what you
> >> >> mean?
> >> >>
> >> >> >
> >> >> >> Everybody agrees that VFS needs more care. But there haven't been
> >> >> >> much
> >> >> >> of concrete suggestions or at least there is no VFS TODO list.
> >> >> >
> >> >> > Everybody agrees on that, true, but we disagree on the direction =
we
> >> >> > should move our VFS, ie. make it more light-weight vs. more
> >> >> > heavy-weight.
> >> >>
> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in
> >> >> replicating all the vnodes lifecycle at the inode level and in the
> >> >> filesystem specific implementation.
> >> >> I don't see a semplification in the work to do, I don't think this =
is
> >> >> going to be simpler for a single specific filesystem (without
> >> >> mentioning the legacy support, which means re-implement inode handl=
ing
> >> >> for every filesystem we have now), we just loose generality.
> >> >>
> >> >> if you want a good example of a VFS primitive that was really
> >> >> UFS-centric and it was mistakenly made generic is vn_start_write() =
and
> >> >> sibillings. I guess it was introduced just to cater UFS snapshot
> >> >> creation and then it poisoned other consumers.
> >> >
> >> > vn_start_write() has nothing to do with filesystem code at all.
> >> > It is purely VFS layer operation, which shall not be called from fs
> >> > code at all. vn_start_secondary_write() is sometimes useful for the
> >> > filesystem itself.
> >> >
> >> > Suspension (not snapshotting) is very useful and allows to avoid some
> >> > nasty issues with unmounts, remounts or guaranteed syncing of the
> >> > filesystem. The fact that only UFS utilizes this functionality just
> >> > shows that other filesystem implementors do not care about this
> >> > correctness, or that other filesystems are not maintained.
> >>
> >> I'm sure that when I looked into it only UFS suspension was being
> >> touched by it and it was introduced back in the days when snapshotting
> >> was sanitized.
> >>
> >> So what are the races it is supposed to fix and other filesystems
> >> don't care about?
> >
> > You cannot reliably sync the filesystem when other writers are active.
> > So, for instance, loop over vnodes fsyncing them in unmount code can ne=
ver
> > terminate. The same is true for remounts rw->ro.
> >
> > One of the possible solution there is to suspend writers. If unmount is
> > successfull, writer will get a failure from vn_start_write() call, while
> > it will proceed normal if unmount is terminated or not started at all.
>=20
> I don't think we implement that right now, IIRC, but it is an interesting=
 idea.

What don't we implement right now ? Take a look at r183074 (Sep 2008).

--k/PDUuKPvLVdBXpq
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk9PktoACgkQC3+MBN1Mb4gFwQCfaxSZ9pfQ+PsYYQmWry7vDHCp
tykAnjplVq3pEMugDE19Yffjtw2mu4j3
=9++M
-----END PGP SIGNATURE-----

--k/PDUuKPvLVdBXpq--

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 15:23:23 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 1963C106567B;
	Thu,  1 Mar 2012 15:23:23 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com
	[209.85.215.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 4574B8FC17;
	Thu,  1 Mar 2012 15:23:22 +0000 (UTC)
Received: by lagv3 with SMTP id v3so1197770lag.13
	for <multiple recipients>; Thu, 01 Mar 2012 07:23:21 -0800 (PST)
Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates
	10.152.147.202 as permitted sender) client-ip=10.152.147.202; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of asmrookie@gmail.com
	designates 10.152.147.202 as permitted sender)
	smtp.mail=asmrookie@gmail.com;
	dkim=pass header.i=asmrookie@gmail.com
Received: from mr.google.com ([10.152.147.202])
	by 10.152.147.202 with SMTP id tm10mr5390208lab.49.1330615401231
	(num_hops = 1); Thu, 01 Mar 2012 07:23:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=AMtp+5UuzzagZLgISIGIj0nfQjIHPYkfsGjrSuQg7vU=;
	b=w0cUtut36Il4G6fAWRPXVuePP9zEZ7ARRw87na7PoDwR6jgQj9EUtMYZ9QeivffEDI
	gUR3XTFYgz96JQBBJxjEml4xR5P82j+xXF5zAYFP3lbm7wFrZ9Lcj/mUqX2iQIItlj1n
	DfoAErn2ntprGaoPbLL1A4h24nXIRX2J3c1EQ=
MIME-Version: 1.0
Received: by 10.152.147.202 with SMTP id tm10mr4385514lab.49.1330615401154;
	Thu, 01 Mar 2012 07:23:21 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 07:23:21 -0800 (PST)
In-Reply-To: <20120301151642.GY55074@deviant.kiev.zoral.com.ua>
References: <20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
	<20120301144708.GV55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
	<20120301150125.GX55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndA=ETSTLCxG1=6G4D0ypaqQB7pDiC=VO==gDyz1BrRWFA@mail.gmail.com>
	<20120301151642.GY55074@deviant.kiev.zoral.com.ua>
Date: Thu, 1 Mar 2012 15:23:21 +0000
X-Google-Sender-Auth: uqNAXcAOSIaEColkq71YVStiA18
Message-ID: <CAJ-FndCoKO9ejs+tAjVDMfeg18n4rYxTD8qPZgCXdccdKqV+8A@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 15:23:23 -0000

2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote:
>> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote:
>> >> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
>> >> >> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
>> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
>> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
>> >> >> >> > - "Every file system needs cache. Let's make it general, so
>> >> >> >> > that
>> >> >> >> > all
>> >> >> >> > file
>> >> >> >> >   systems can use it!" Well, for VFS each file system is a
>> >> >> >> > separate
>> >> >> >> >   entity, which is not the case for ZFS. ZFS can cache one
>> >> >> >> > block
>> >> >> >> > only
>> >> >> >> >   once that is used by one file system, 10 clones and 100
>> >> >> >> > snapshots,
>> >> >> >> >   which all are separate mount points from VFS perspective.
>> >> >> >> >   The same block would be cached 111 times by the buffer cache.
>> >> >> >>
>> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
>> >> >> >> cache_entry() on your own), add a number of cache_prune calls.
>> >> >> >> It's
>> >> >> >> pretty much library-like design you describe below.
>> >> >> >
>> >> >> > Yes, namecache is already library-like, but I was talking about
>> >> >> > the
>> >> >> > buffer cache. I managed to bypass it eventually with suggestions
>> >> >> > from
>> >> >> > ups@, but for a long time I was sure it isn't at all possible.
>> >> >>
>> >> >> Can you please clarify on this as I really don't understand what you
>> >> >> mean?
>> >> >>
>> >> >> >
>> >> >> >> Everybody agrees that VFS needs more care. But there haven't been
>> >> >> >> much
>> >> >> >> of concrete suggestions or at least there is no VFS TODO list.
>> >> >> >
>> >> >> > Everybody agrees on that, true, but we disagree on the direction
>> >> >> > we
>> >> >> > should move our VFS, ie. make it more light-weight vs. more
>> >> >> > heavy-weight.
>> >> >>
>> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in
>> >> >> replicating all the vnodes lifecycle at the inode level and in the
>> >> >> filesystem specific implementation.
>> >> >> I don't see a semplification in the work to do, I don't think this
>> >> >> is
>> >> >> going to be simpler for a single specific filesystem (without
>> >> >> mentioning the legacy support, which means re-implement inode
>> >> >> handling
>> >> >> for every filesystem we have now), we just loose generality.
>> >> >>
>> >> >> if you want a good example of a VFS primitive that was really
>> >> >> UFS-centric and it was mistakenly made generic is vn_start_write()
>> >> >> and
>> >> >> sibillings. I guess it was introduced just to cater UFS snapshot
>> >> >> creation and then it poisoned other consumers.
>> >> >
>> >> > vn_start_write() has nothing to do with filesystem code at all.
>> >> > It is purely VFS layer operation, which shall not be called from fs
>> >> > code at all. vn_start_secondary_write() is sometimes useful for the
>> >> > filesystem itself.
>> >> >
>> >> > Suspension (not snapshotting) is very useful and allows to avoid some
>> >> > nasty issues with unmounts, remounts or guaranteed syncing of the
>> >> > filesystem. The fact that only UFS utilizes this functionality just
>> >> > shows that other filesystem implementors do not care about this
>> >> > correctness, or that other filesystems are not maintained.
>> >>
>> >> I'm sure that when I looked into it only UFS suspension was being
>> >> touched by it and it was introduced back in the days when snapshotting
>> >> was sanitized.
>> >>
>> >> So what are the races it is supposed to fix and other filesystems
>> >> don't care about?
>> >
>> > You cannot reliably sync the filesystem when other writers are active.
>> > So, for instance, loop over vnodes fsyncing them in unmount code can
>> > never
>> > terminate. The same is true for remounts rw->ro.
>> >
>> > One of the possible solution there is to suspend writers. If unmount is
>> > successfull, writer will get a failure from vn_start_write() call, while
>> > it will proceed normal if unmount is terminated or not started at all.
>>
>> I don't think we implement that right now, IIRC, but it is an interesting
>> idea.
>
> What don't we implement right now ? Take a look at r183074 (Sep 2008).

Ah sorry, I looked into it before 2008 effectively (and that also
reminds me why I stopped working on removing that primitive from VFS
and make it UFS specific one) :)

However why we cannot make a fix like that in domount()/dounmount()
directly for every R/W filesystem?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 15:35:58 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B2A4E1065676;
	Thu,  1 Mar 2012 15:35:58 +0000 (UTC)
	(envelope-from kostikbel@gmail.com)
Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200])
	by mx1.freebsd.org (Postfix) with ESMTP id 0C5398FC1E;
	Thu,  1 Mar 2012 15:35:57 +0000 (UTC)
Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1])
	by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21FZfl6099329;
	Thu, 1 Mar 2012 17:35:41 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1])
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id
	q21FZfFQ075215; Thu, 1 Mar 2012 17:35:41 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
Received: (from kostik@localhost)
	by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21FZfFu075214; 
	Thu, 1 Mar 2012 17:35:41 +0200 (EET)
	(envelope-from kostikbel@gmail.com)
X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to
	kostikbel@gmail.com using -f
Date: Thu, 1 Mar 2012 17:35:41 +0200
From: Konstantin Belousov <kostikbel@gmail.com>
To: Attilio Rao <attilio@freebsd.org>
Message-ID: <20120301153541.GZ55074@deviant.kiev.zoral.com.ua>
References: <20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
	<20120301144708.GV55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
	<20120301150125.GX55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndA=ETSTLCxG1=6G4D0ypaqQB7pDiC=VO==gDyz1BrRWFA@mail.gmail.com>
	<20120301151642.GY55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndCoKO9ejs+tAjVDMfeg18n4rYxTD8qPZgCXdccdKqV+8A@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="JjEBCMAGNkRv8xbT"
Content-Disposition: inline
In-Reply-To: <CAJ-FndCoKO9ejs+tAjVDMfeg18n4rYxTD8qPZgCXdccdKqV+8A@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i
X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua
X-Virus-Status: Clean
X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00
	autolearn=ham version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
	skuns.kiev.zoral.com.ua
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 15:35:58 -0000


--JjEBCMAGNkRv8xbT
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Mar 01, 2012 at 03:23:21PM +0000, Attilio Rao wrote:
> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> > On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote:
> >> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> >> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote:
> >> >> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> >> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
> >> >> >> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
> >> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> >> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
> >> >> >> >> > - "Every file system needs cache. Let's make it general, so
> >> >> >> >> > that
> >> >> >> >> > all
> >> >> >> >> > file
> >> >> >> >> >   systems can use it!" Well, for VFS each file system is a
> >> >> >> >> > separate
> >> >> >> >> >   entity, which is not the case for ZFS. ZFS can cache one
> >> >> >> >> > block
> >> >> >> >> > only
> >> >> >> >> >   once that is used by one file system, 10 clones and 100
> >> >> >> >> > snapshots,
> >> >> >> >> >   which all are separate mount points from VFS perspective.
> >> >> >> >> >   The same block would be cached 111 times by the buffer ca=
che.
> >> >> >> >>
> >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
> >> >> >> >> cache_entry() on your own), add a number of cache_prune calls.
> >> >> >> >> It's
> >> >> >> >> pretty much library-like design you describe below.
> >> >> >> >
> >> >> >> > Yes, namecache is already library-like, but I was talking about
> >> >> >> > the
> >> >> >> > buffer cache. I managed to bypass it eventually with suggestio=
ns
> >> >> >> > from
> >> >> >> > ups@, but for a long time I was sure it isn't at all possible.
> >> >> >>
> >> >> >> Can you please clarify on this as I really don't understand what=
 you
> >> >> >> mean?
> >> >> >>
> >> >> >> >
> >> >> >> >> Everybody agrees that VFS needs more care. But there haven't =
been
> >> >> >> >> much
> >> >> >> >> of concrete suggestions or at least there is no VFS TODO list.
> >> >> >> >
> >> >> >> > Everybody agrees on that, true, but we disagree on the directi=
on
> >> >> >> > we
> >> >> >> > should move our VFS, ie. make it more light-weight vs. more
> >> >> >> > heavy-weight.
> >> >> >>
> >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in
> >> >> >> replicating all the vnodes lifecycle at the inode level and in t=
he
> >> >> >> filesystem specific implementation.
> >> >> >> I don't see a semplification in the work to do, I don't think th=
is
> >> >> >> is
> >> >> >> going to be simpler for a single specific filesystem (without
> >> >> >> mentioning the legacy support, which means re-implement inode
> >> >> >> handling
> >> >> >> for every filesystem we have now), we just loose generality.
> >> >> >>
> >> >> >> if you want a good example of a VFS primitive that was really
> >> >> >> UFS-centric and it was mistakenly made generic is vn_start_write=
()
> >> >> >> and
> >> >> >> sibillings. I guess it was introduced just to cater UFS snapshot
> >> >> >> creation and then it poisoned other consumers.
> >> >> >
> >> >> > vn_start_write() has nothing to do with filesystem code at all.
> >> >> > It is purely VFS layer operation, which shall not be called from =
fs
> >> >> > code at all. vn_start_secondary_write() is sometimes useful for t=
he
> >> >> > filesystem itself.
> >> >> >
> >> >> > Suspension (not snapshotting) is very useful and allows to avoid =
some
> >> >> > nasty issues with unmounts, remounts or guaranteed syncing of the
> >> >> > filesystem. The fact that only UFS utilizes this functionality ju=
st
> >> >> > shows that other filesystem implementors do not care about this
> >> >> > correctness, or that other filesystems are not maintained.
> >> >>
> >> >> I'm sure that when I looked into it only UFS suspension was being
> >> >> touched by it and it was introduced back in the days when snapshott=
ing
> >> >> was sanitized.
> >> >>
> >> >> So what are the races it is supposed to fix and other filesystems
> >> >> don't care about?
> >> >
> >> > You cannot reliably sync the filesystem when other writers are activ=
e.
> >> > So, for instance, loop over vnodes fsyncing them in unmount code can
> >> > never
> >> > terminate. The same is true for remounts rw->ro.
> >> >
> >> > One of the possible solution there is to suspend writers. If unmount=
 is
> >> > successfull, writer will get a failure from vn_start_write() call, w=
hile
> >> > it will proceed normal if unmount is terminated or not started at al=
l.
> >>
> >> I don't think we implement that right now, IIRC, but it is an interest=
ing
> >> idea.
> >
> > What don't we implement right now ? Take a look at r183074 (Sep 2008).
>=20
> Ah sorry, I looked into it before 2008 effectively (and that also
> reminds me why I stopped working on removing that primitive from VFS
> and make it UFS specific one) :)
>=20
> However why we cannot make a fix like that in domount()/dounmount()
> directly for every R/W filesystem?
At least, the filesystem needs to implement the VFS_SUSP_CLEAN VFS op.
The purpose of the operation is to clean up after suspension, e.g.
in the UFS case, VFS_SUSP_CLEAN removes unlinked files which reference
count went to 0 during suspension, as well as process delayed atime
updating.

Another issue that I see is handling of filesystems that offload i/o to
several threads. The unmount thread is given special rights to perform
i/o while filesystem is suspended, but VFS cannot know about other threads
that shall be permitted to perform writes.

At least those are two issues that appeared during applying the suspension
to UFS unmount and which I remember.

With all this complications, suspension is provided in a form of library
for use by filesystem implementors, and not as a mandatory feature of VFS.

--JjEBCMAGNkRv8xbT
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk9Pl00ACgkQC3+MBN1Mb4i61gCfbNMsO6TQXa6gYB73u/0gKYjf
leIAnRYbWi3DKaiOQD1fRnXzYM/gxM3b
=h3Yh
-----END PGP SIGNATURE-----

--JjEBCMAGNkRv8xbT--

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 16:45:12 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D9866106564A;
	Thu,  1 Mar 2012 16:45:12 +0000 (UTC)
	(envelope-from gleb.kurtsou@gmail.com)
Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 974E58FC1A;
	Thu,  1 Mar 2012 16:45:11 +0000 (UTC)
Received: by eekd17 with SMTP id d17so314788eek.13
	for <multiple recipients>; Thu, 01 Mar 2012 08:45:10 -0800 (PST)
Received-SPF: pass (google.com: domain of gleb.kurtsou@gmail.com designates
	10.112.84.1 as permitted sender) client-ip=10.112.84.1; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of gleb.kurtsou@gmail.com
	designates 10.112.84.1 as permitted sender)
	smtp.mail=gleb.kurtsou@gmail.com;
	dkim=pass header.i=gleb.kurtsou@gmail.com
Received: from mr.google.com ([10.112.84.1])
	by 10.112.84.1 with SMTP id u1mr2739670lby.35.1330620310745 (num_hops =
	1); Thu, 01 Mar 2012 08:45:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=date:from:to:cc:subject:message-id:references:mime-version
	:content-type:content-disposition:in-reply-to:user-agent;
	bh=7S898NWJKgzv2C1Ylz/B72zxBZnpVCk2CQ3ZJMPYCJY=;
	b=cR6WU+teca3cDe8EeyS4nc9Xkk7XAH2XX09zhYQate7WojlLzWZmU6aM1kLV2EPkm+
	JUbhxmBHvJ3Il9sYRRs1Uq3DuzbsoqEcq54noXuvoUPevQGY5BQQzkF1KIOrPUQAYNmH
	AqWfdvrQYLokuUAwrLN+CDZYPaKAqN5ksZpsQ=
Received: by 10.112.84.1 with SMTP id u1mr2248604lby.35.1330620310553;
	Thu, 01 Mar 2012 08:45:10 -0800 (PST)
Received: from localhost ([78.157.92.5])
	by mx.google.com with ESMTPS id f2sm3661105lbw.5.2012.03.01.08.45.09
	(version=SSLv3 cipher=OTHER); Thu, 01 Mar 2012 08:45:09 -0800 (PST)
Date: Thu, 1 Mar 2012 18:45:11 +0200
From: Gleb Kurtsou <gleb.kurtsou@gmail.com>
To: Christoph Hellwig <hch@infradead.org>
Message-ID: <20120301164511.GA1501@reks>
References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua>
	<CAJ-FndABi21GfcCRTZizCPc_Mnxm1EY271BiXcYt9SD_zXFpXw@mail.gmail.com>
	<20120225151334.GH1344@garage.freebsd.pl>
	<CAJ-FndBBKHrpB1MNJTXx8gkFXR2d-O6k5-HJeOAyv2DznpN-QQ@mail.gmail.com>
	<20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks> <20120301141010.GA7079@infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20120301141010.GA7079@infradead.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: Attilio Rao <attilio@freebsd.org>,
	Konstantin Belousov <kostikbel@gmail.com>,
	Pawel Jakub Dawidek <pjd@FreeBSD.org>, arch@freebsd.org
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 16:45:12 -0000

On (01/03/2012 09:10), Christoph Hellwig wrote:
> On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> > Are you aware of a real "libraries for file systems" VFS example? It
> > sounds very interesting but I'm afraid it's going to look good only in
> > theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks
> > rather messy (IMHO) and more likely to be bug prone. On the other side
> > Linux has optional per file system rename lock making VOP_RENAME
> > implementation much easier, while ours is tremendously difficult to do
> > right.
> 
> All namespace locking in Linux is in the VFS, and it mandatory.  A
> filesystem wide lock is only used for cross-directory renames.
> 
> A more detailed description is here:
> 
> 	http://git.kernel.dk/?p=linux.git;a=blob;f=Documentation/filesystems/directory-locking
> 

My bad. I thought s_vfs_rename_mutex can be optional. Quite unfortunate
linux doesn't support concurrent cross-directory renames :)

Thanks,
Gleb.

From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 17:05:02 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 4E141106564A;
	Thu,  1 Mar 2012 17:05:02 +0000 (UTC)
	(envelope-from asmrookie@gmail.com)
Received: from mail-vx0-f182.google.com (mail-vx0-f182.google.com
	[209.85.220.182])
	by mx1.freebsd.org (Postfix) with ESMTP id C9B5A8FC13;
	Thu,  1 Mar 2012 17:05:01 +0000 (UTC)
Received: by vcbfl15 with SMTP id fl15so821083vcb.13
	for <multiple recipients>; Thu, 01 Mar 2012 09:05:01 -0800 (PST)
Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates
	10.52.99.169 as permitted sender) client-ip=10.52.99.169; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of asmrookie@gmail.com
	designates 10.52.99.169 as permitted sender)
	smtp.mail=asmrookie@gmail.com;
	dkim=pass header.i=asmrookie@gmail.com
Received: from mr.google.com ([10.52.99.169])
	by 10.52.99.169 with SMTP id er9mr9140528vdb.126.1330621501308
	(num_hops = 1); Thu, 01 Mar 2012 09:05:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type;
	bh=FFAhnUi55uRUY3tqk6a/YHI48a5bXe4jvv+rNuyj3uE=;
	b=Gxm0UCPxGjPU38mVxYBbylHdr+PPePT9zXJtuT9+AIFf/R0ywwFXg5NwaTk92Lri+h
	B2eqnYlZkxJDoW7E7xjUE0uXSDtwNbCJqJlogeTR6jLV2Ij2vGnv4OWBynHr0w6z73TO
	QoRoGLyR6Yp0uwAHrQ3o1VDG4UHpZZLGydWgY=
MIME-Version: 1.0
Received: by 10.52.99.169 with SMTP id er9mr7754144vdb.126.1330621501096; Thu,
	01 Mar 2012 09:05:01 -0800 (PST)
Sender: asmrookie@gmail.com
Received: by 10.220.38.72 with HTTP; Thu, 1 Mar 2012 09:05:00 -0800 (PST)
In-Reply-To: <20120301153541.GZ55074@deviant.kiev.zoral.com.ua>
References: <20120225194630.GI1344@garage.freebsd.pl>
	<20120301111624.GB30991@reks>
	<20120301141247.GE1336@garage.freebsd.pl>
	<CAJ-FndCSPHLGqkeTC6qiitap_zjgLki+8HWta-UxReVvntA9=g@mail.gmail.com>
	<20120301144708.GV55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndAKs-PK7odTMmh2bSkHvTddbUuO=Espzf8sZReT8KhbxQ@mail.gmail.com>
	<20120301150125.GX55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndA=ETSTLCxG1=6G4D0ypaqQB7pDiC=VO==gDyz1BrRWFA@mail.gmail.com>
	<20120301151642.GY55074@deviant.kiev.zoral.com.ua>
	<CAJ-FndCoKO9ejs+tAjVDMfeg18n4rYxTD8qPZgCXdccdKqV+8A@mail.gmail.com>
	<20120301153541.GZ55074@deviant.kiev.zoral.com.ua>
Date: Thu, 1 Mar 2012 17:05:00 +0000
X-Google-Sender-Auth: o5p0MzBSUTUxIxtcQxgyi0p3mjw
Message-ID: <CAJ-FndAAnK9nTVMKz9ONJbXWe73A_MZ=VuVq-4gOzE7hcc9ibg@mail.gmail.com>
From: Attilio Rao <attilio@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Content-Type: text/plain; charset=UTF-8
Cc: arch@freebsd.org, Gleb Kurtsou <gleb.kurtsou@gmail.com>,
	Pawel Jakub Dawidek <pjd@freebsd.org>
Subject: Re: Prefaulting for i/o buffers
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 17:05:02 -0000

2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
> On Thu, Mar 01, 2012 at 03:23:21PM +0000, Attilio Rao wrote:
>> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> > On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote:
>> >> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> >> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote:
>> >> >> 2012/3/1, Konstantin Belousov <kostikbel@gmail.com>:
>> >> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
>> >> >> >> 2012/3/1, Pawel Jakub Dawidek <pjd@freebsd.org>:
>> >> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
>> >> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
>> >> >> >> >> > - "Every file system needs cache. Let's make it general, so
>> >> >> >> >> > that
>> >> >> >> >> > all
>> >> >> >> >> > file
>> >> >> >> >> >   systems can use it!" Well, for VFS each file system is a
>> >> >> >> >> > separate
>> >> >> >> >> >   entity, which is not the case for ZFS. ZFS can cache one
>> >> >> >> >> > block
>> >> >> >> >> > only
>> >> >> >> >> >   once that is used by one file system, 10 clones and 100
>> >> >> >> >> > snapshots,
>> >> >> >> >> >   which all are separate mount points from VFS perspective.
>> >> >> >> >> >   The same block would be cached 111 times by the buffer
>> >> >> >> >> > cache.
>> >> >> >> >>
>> >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
>> >> >> >> >> cache_entry() on your own), add a number of cache_prune calls.
>> >> >> >> >> It's
>> >> >> >> >> pretty much library-like design you describe below.
>> >> >> >> >
>> >> >> >> > Yes, namecache is already library-like, but I was talking about
>> >> >> >> > the
>> >> >> >> > buffer cache. I managed to bypass it eventually with
>> >> >> >> > suggestions
>> >> >> >> > from
>> >> >> >> > ups@, but for a long time I was sure it isn't at all possible.
>> >> >> >>
>> >> >> >> Can you please clarify on this as I really don't understand what
>> >> >> >> you
>> >> >> >> mean?
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> Everybody agrees that VFS needs more care. But there haven't
>> >> >> >> >> been
>> >> >> >> >> much
>> >> >> >> >> of concrete suggestions or at least there is no VFS TODO list.
>> >> >> >> >
>> >> >> >> > Everybody agrees on that, true, but we disagree on the
>> >> >> >> > direction
>> >> >> >> > we
>> >> >> >> > should move our VFS, ie. make it more light-weight vs. more
>> >> >> >> > heavy-weight.
>> >> >> >>
>> >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in
>> >> >> >> replicating all the vnodes lifecycle at the inode level and in
>> >> >> >> the
>> >> >> >> filesystem specific implementation.
>> >> >> >> I don't see a semplification in the work to do, I don't think
>> >> >> >> this
>> >> >> >> is
>> >> >> >> going to be simpler for a single specific filesystem (without
>> >> >> >> mentioning the legacy support, which means re-implement inode
>> >> >> >> handling
>> >> >> >> for every filesystem we have now), we just loose generality.
>> >> >> >>
>> >> >> >> if you want a good example of a VFS primitive that was really
>> >> >> >> UFS-centric and it was mistakenly made generic is
>> >> >> >> vn_start_write()
>> >> >> >> and
>> >> >> >> sibillings. I guess it was introduced just to cater UFS snapshot
>> >> >> >> creation and then it poisoned other consumers.
>> >> >> >
>> >> >> > vn_start_write() has nothing to do with filesystem code at all.
>> >> >> > It is purely VFS layer operation, which shall not be called from
>> >> >> > fs
>> >> >> > code at all. vn_start_secondary_write() is sometimes useful for
>> >> >> > the
>> >> >> > filesystem itself.
>> >> >> >
>> >> >> > Suspension (not snapshotting) is very useful and allows to avoid
>> >> >> > some
>> >> >> > nasty issues with unmounts, remounts or guaranteed syncing of the
>> >> >> > filesystem. The fact that only UFS utilizes this functionality
>> >> >> > just
>> >> >> > shows that other filesystem implementors do not care about this
>> >> >> > correctness, or that other filesystems are not maintained.
>> >> >>
>> >> >> I'm sure that when I looked into it only UFS suspension was being
>> >> >> touched by it and it was introduced back in the days when
>> >> >> snapshotting
>> >> >> was sanitized.
>> >> >>
>> >> >> So what are the races it is supposed to fix and other filesystems
>> >> >> don't care about?
>> >> >
>> >> > You cannot reliably sync the filesystem when other writers are
>> >> > active.
>> >> > So, for instance, loop over vnodes fsyncing them in unmount code can
>> >> > never
>> >> > terminate. The same is true for remounts rw->ro.
>> >> >
>> >> > One of the possible solution there is to suspend writers. If unmount
>> >> > is
>> >> > successfull, writer will get a failure from vn_start_write() call,
>> >> > while
>> >> > it will proceed normal if unmount is terminated or not started at
>> >> > all.
>> >>
>> >> I don't think we implement that right now, IIRC, but it is an
>> >> interesting
>> >> idea.
>> >
>> > What don't we implement right now ? Take a look at r183074 (Sep 2008).
>>
>> Ah sorry, I looked into it before 2008 effectively (and that also
>> reminds me why I stopped working on removing that primitive from VFS
>> and make it UFS specific one) :)
>>
>> However why we cannot make a fix like that in domount()/dounmount()
>> directly for every R/W filesystem?
> At least, the filesystem needs to implement the VFS_SUSP_CLEAN VFS op.
> The purpose of the operation is to clean up after suspension, e.g.
> in the UFS case, VFS_SUSP_CLEAN removes unlinked files which reference
> count went to 0 during suspension, as well as process delayed atime
> updating.
>
> Another issue that I see is handling of filesystems that offload i/o to
> several threads. The unmount thread is given special rights to perform
> i/o while filesystem is suspended, but VFS cannot know about other threads
> that shall be permitted to perform writes.
>
> At least those are two issues that appeared during applying the suspension
> to UFS unmount and which I remember.
>
> With all this complications, suspension is provided in a form of library
> for use by filesystem implementors, and not as a mandatory feature of VFS.

It makes sense, thanks for explaining the issues you found while
implementing this trick on UFS.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein

From owner-freebsd-arch@FreeBSD.ORG  Fri Mar  2 17:28:58 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 460BD1065670
	for <freebsd-arch@freebsd.org>; Fri,  2 Mar 2012 17:28:58 +0000 (UTC)
	(envelope-from johnandsara2@cox.net)
Received: from eastrmfepi108.cox.net (eastrmfepi108.cox.net [68.230.241.204])
	by mx1.freebsd.org (Postfix) with ESMTP id D4C068FC12
	for <freebsd-arch@freebsd.org>; Fri,  2 Mar 2012 17:28:57 +0000 (UTC)
Received: from eastrmimpo210.cox.net ([68.230.241.225])
	by eastrmfepo101.cox.net
	(InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id
	<20120302170314.VUUA18243.eastrmfepo101.cox.net@eastrmimpo210.cox.net>
	for <freebsd-arch@freebsd.org>; Fri, 2 Mar 2012 12:03:14 -0500
Received: from [192.168.3.22] ([70.177.172.35])
	by eastrmimpo210.cox.net with bizsmtp
	id gh3D1i0030mAvba02h3DbX; Fri, 02 Mar 2012 12:03:13 -0500
X-CT-Class: Clean
X-CT-Score: 0.00
X-CT-RefID: str=0001.0A02020A.4F50FD51.01D4,ss=1,re=0.000,fgs=0
X-CT-Spam: 0
X-Authority-Analysis: v=1.1 cv=SwD/Y8GpRdONdm5z1I4vXlgMxpglwSfl+jzqXqLOMWM=
	c=1 sm=1 a=f5xKl4ys9bwA:10 a=j9h4hM69ZBMA:10 a=G8Uczd0VNMoA:10
	a=Wajolswj7cQA:10 a=8nJEP1OIZ-IA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:17
	a=FP58Ms26AAAA:8 a=efqDOYLTNkgstY12pRcA:9 a=oRuVXIjJWgrHfGDOSYoA:7
	a=wPNLvfGTeEIA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:117
X-CM-Score: 0.00
Authentication-Results: cox.net; none
Message-ID: <4F50FD4D.9000106@cox.net>
Date: Fri, 02 Mar 2012 12:03:09 -0500
From: "John D. Hendrickson and Sara Darnell" <johnandsara2@cox.net>
User-Agent: Thunderbird 2.0.0.24 (X11/20100228)
MIME-Version: 1.0
To: freebsd-arch@freebsd.org
References: <4E18ABB1.4010304@cox.net> <20110709194639.GA4914@elie>
	<4E18EE60.7010402@cox.net> <20110710151354.GA25475@r500-debian>
In-Reply-To: <20110710151354.GA25475@r500-debian>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: dep-trace  v.  tsort  (mac ports depends support)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: johnandsara2@cox.net
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Mar 2012 17:28:58 -0000

Hi,

BSD and Apple needs tsort(1) for portage still I believe.

Topological sorting isn't quite right packaging.

Please see:  http://sourceforge.net/projects/dep-trace

It is a "drop-in" replacement (operates like a /bin/tsort) but is right for pkg depends

(ie, for portage: you need to dl source, order of compile may be required, sometimes gets missing
message or "loop in depends" message when attempting to compile and install pkg)

I'm a debian user but i wish I had a bsd machine :) So i do not know allot of BSD maintainer /
mailing list specifics.  Please give me a handicap there !

Thanks and thanks again,

	John

p.s.

(dep-trace itself has no depends (a /bin), has improvements, and is "more hackable" than tsort as to
coding new ordering rules against lists - which in tsort "loop detected attempting to recover" is
not as easy i feel.



From owner-freebsd-arch@FreeBSD.ORG  Fri Mar  2 17:38:51 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id B46EE106566B
	for <arch@FreeBSD.ORG>; Fri,  2 Mar 2012 17:38:51 +0000 (UTC)
	(envelope-from das@FreeBSD.ORG)
Received: from zim.MIT.EDU (ZIM.MIT.EDU [18.95.3.101])
	by mx1.freebsd.org (Postfix) with ESMTP id 734898FC17
	for <arch@FreeBSD.ORG>; Fri,  2 Mar 2012 17:38:51 +0000 (UTC)
Received: from zim.MIT.EDU (localhost [127.0.0.1])
	by zim.MIT.EDU (8.14.5/8.14.2) with ESMTP id q22HGl6G030007;
	Fri, 2 Mar 2012 12:16:47 -0500 (EST) (envelope-from das@FreeBSD.ORG)
Received: (from das@localhost)
	by zim.MIT.EDU (8.14.5/8.14.2/Submit) id q22HGlXU030006;
	Fri, 2 Mar 2012 12:16:47 -0500 (EST) (envelope-from das@FreeBSD.ORG)
Date: Fri, 2 Mar 2012 12:16:47 -0500
From: David Schultz <das@FreeBSD.ORG>
To: Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= <des@des.no>
Message-ID: <20120302171647.GA29850@zim.MIT.EDU>
Mail-Followup-To: Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= <des@des.no>,
	Garrett Wollman <wollman@hergotha.csail.mit.edu>, arch@freebsd.org
References: <4F3C2D2D.5000402@FreeBSD.org> <4F3E78BA.4060203@FreeBSD.org>
	<864nupcuvl.fsf@ds4.des.no> <4F3E7B5A.20103@FreeBSD.org>
	<86zkchbff6.fsf@ds4.des.no> <4F3EADB5.7060008@FreeBSD.org>
	<20120223170918.GA79013@zim.MIT.EDU>
	<201202231822.q1NIMQOd020804@hergotha.csail.mit.edu>
	<201202231926.q1NJQPFa021654@hergotha.csail.mit.edu>
	<86d3958cqi.fsf@ds4.des.no>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <86d3958cqi.fsf@ds4.des.no>
Cc: arch@FreeBSD.ORG, Garrett Wollman <wollman@hergotha.csail.mit.edu>
Subject: Re: bsd/citrus iconv
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Mar 2012 17:38:51 -0000

On Thu, Feb 23, 2012, Dag-Erling Sm?rgrav wrote:
> Garrett Wollman <wollman@hergotha.csail.mit.edu> writes:
> > You missed the bit on the next page:
> >
> > 	It is unspecified whether the libraries libc.a, libm.a,
> > 	librt.a, libpthread.a, libl.a, liby.a, or libxnet exist as
> > 	regular files. The implementation may accept as -l operands
> > 	names of objects that do not exist as regular files.
> 
> That's entirely academic unless you want to modify gcc and clang to
> automatically pull in libiconv.  The point is that if the iconv
> extension is implemented, it must be available without requiring
> additional -l options.

If the linker included libiconv automatically, would it be
possible to switch iconv implementations without recompiling, by
using libmap.conf?  Or is the ABI (e.g., type of iconv_t)
incompatible?  If the ABI is different, then we might as well
stick iconv in libc using weak symbols.

> It all boils down to this: do we aspire to SUS conformance?

I think it actually boils down to what the practical benefit is.
Does it create a compatibility nightmare for apps to have to use
the -liconv flag?  Do other platforms require it?  IIRC, we've
been patching ports to include the flag for years.

From owner-freebsd-arch@FreeBSD.ORG  Fri Mar  2 18:31:41 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 2BF2C106566B
	for <freebsd-arch@freebsd.org>; Fri,  2 Mar 2012 18:31:41 +0000 (UTC)
	(envelope-from ed@hoeg.nl)
Received: from mx0.hoeg.nl (mx0.hoeg.nl [IPv6:2a01:4f8:101:5343::aa])
	by mx1.freebsd.org (Postfix) with ESMTP id B86FB8FC14
	for <freebsd-arch@freebsd.org>; Fri,  2 Mar 2012 18:31:40 +0000 (UTC)
Received: by mx0.hoeg.nl (Postfix, from userid 1000)
	id EE4AB2A28CEE; Fri,  2 Mar 2012 19:31:38 +0100 (CET)
Date: Fri, 2 Mar 2012 19:31:38 +0100
From: Ed Schouten <ed@80386.nl>
To: "John D. Hendrickson and Sara Darnell" <johnandsara2@cox.net>
Message-ID: <20120302183138.GC32748@hoeg.nl>
References: <4E18ABB1.4010304@cox.net> <20110709194639.GA4914@elie>
	<4E18EE60.7010402@cox.net> <20110710151354.GA25475@r500-debian>
	<4F50FD4D.9000106@cox.net>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="cN+O50sc7gZAK+8F"
Content-Disposition: inline
In-Reply-To: <4F50FD4D.9000106@cox.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-arch@freebsd.org
Subject: Re: dep-trace  v.  tsort  (mac ports depends support)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Mar 2012 18:31:41 -0000


--cN+O50sc7gZAK+8F
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi John,

* John D. Hendrickson and Sara Darnell <johnandsara2@cox.net>, 20120302 18:=
03:
> BSD and Apple needs tsort(1) for portage still I believe.
>
> Topological sorting isn't quite right packaging.
>
> [...]
>
> (ie, for portage: you need to dl source, order of compile may be
> required, sometimes gets missing message or "loop in depends" message
> when attempting to compile and install pkg)

But wait. Isn't this because of mis-use of tsort(1) by portage?

tsort(1) can give you any ordering you like, as long as you make sure
your input graph is correct.

--=20
 Ed Schouten <ed@80386.nl>
 WWW: http://80386.nl/

--cN+O50sc7gZAK+8F
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iQIcBAEBAgAGBQJPURIKAAoJEG5e2P40kaK7pTIQAKKfXXfZPI6k8LKV/TBW09f9
Kzv4wWbbSkNn/1kQ/1VYIBKGIKkuIP6kTjpZ1DrlpfrTTt99iVj329rNrrwrZZ6W
AUKnLkA7ddy4/sqRcMCeV0m8Z1QkCprgeVFuQ+Fr9RYIaVEhJkuE0FXTEJuctZ0k
Ol30KCrPVeCoNY89iarXv3/DKOKmWbF7MAVXtTXt2ucL8oWWAu/nLsXd5MFQi4HF
vjOi/D8nVaP4p/2fssPEBBT2U37LVwKL9uVcHm6FhJByJeA4GvcOgJTGUEpLvlSU
STn7BRV9NFBsd07Kaid8csPO7HTfCLYMpZuy6wsfLHjX/ghjNLs6DF8n/A9x6WhA
MK8DXyGGcUY8V2YTZ0EmWvurryi/RfcFsWOAj3/BtKUw158lnIn293FFGiitdegt
ZEI1Y7P7Ap2H3tihEQ5JLjk6xaAHPWvSaWb4oISd48V9kkLC0TJH+sDuyww6/CCD
ZR0ZnrWhY/4ptkcNuIDb+xxtJinqw2lFAut6I4HP0SYb0ehQmhYzLDsOz9vxKGUL
+Fgh4fdZHiiEF2KNY0iY62plaGMJkqncpb4ecRUXShibuMBeEqu4EifofSwFuOK7
4pwo1whiS/SFBgzF5vacERVn7WUOXir0ipt7Qu7zD4ZrlaaaQmV6VnL4NkdY9KRu
dtwgAuIADMuljFJk8PtC
=x52f
-----END PGP SIGNATURE-----

--cN+O50sc7gZAK+8F--

From owner-freebsd-arch@FreeBSD.ORG  Fri Mar  2 22:25:30 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2B1641065678
	for <freebsd-arch@freebsd.org>; Fri,  2 Mar 2012 22:25:30 +0000 (UTC)
	(envelope-from jilles@stack.nl)
Received: from mx1.stack.nl (relay04.stack.nl [IPv6:2001:610:1108:5010::107])
	by mx1.freebsd.org (Postfix) with ESMTP id 78A5C8FC26
	for <freebsd-arch@freebsd.org>; Fri,  2 Mar 2012 22:25:29 +0000 (UTC)
Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131])
	by mx1.stack.nl (Postfix) with ESMTP id 350B41DD633;
	Fri,  2 Mar 2012 23:25:27 +0100 (CET)
Received: by snail.stack.nl (Postfix, from userid 1677)
	id 1B3C728470; Fri,  2 Mar 2012 23:25:27 +0100 (CET)
Date: Fri, 2 Mar 2012 23:25:27 +0100
From: Jilles Tjoelker <jilles@stack.nl>
To: Sergey Kandaurov <pluknet@gmail.com>
Message-ID: <20120302222526.GB6416@stack.nl>
References: <4F4DC876.3010809@delphij.net>
	<CAE-mSOJU=hm8+-AC_oQmx+h2grv7PGaH7kNYKoT3GMePDPXsYg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAE-mSOJU=hm8+-AC_oQmx+h2grv7PGaH7kNYKoT3GMePDPXsYg@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: d@delphij.net, freebsd-arch@freebsd.org
Subject: Re: RFC: futimens(2) and utimensat(2)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Mar 2012 22:25:30 -0000

On Wed, Feb 29, 2012 at 02:21:23PM +0300, Sergey Kandaurov wrote:
> On 29 February 2012 10:40, Xin Li <delphij@delphij.net> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256

> > These are required by IEEE Std 1003.1-2008. ?Patchset at:

> > http://people.freebsd.org/~delphij/for_review/utimens.diff

> First, thank you very much for doing this.

> ERRORS section for utimes(2) is still not updated (not exists).
> Funny but that was the most difficult part to implement these
> syscalls a year ago with the great help from jilles@.
> He could further comment on your patchset.

> Otherwise looks good and pretty similar to my work, though
> I didn't use a "const" modifier in my version for both functions
> and syscall definitions in syscall.master for some reasons.

> Further I wrote a test to see how properly implementation detects
> EACCES/EPERM with different UTIME_OMIT/UTIME_NOW passed. It shall pass
> all tests as shown in the table (stolen somewhere from austingroup):

>   [a]    [b]      [c]
>  times  file     file
>  arg.    UID      is
>  NULL   owner   writable        Result
>  !NULL  !owner  !writable
> 
>  N      o          w            success
>  N      o          !w           success
>  N      !          w            success
>  N      !o         !w           EACCES [1]
>  !N     o          w            success
>  !N     o          !w           success
>  !N     !o         w            EPERM [2]
>  !N     !o         !w           EPERM [3]

> Here NULL also covers cases when:
> - both fields are UTIME_NULL
> - both fields are UTIME_OMIT.

If both fields are UTIME_NOW, this shall be the same as a NULL pointer.
If both fields are UTIME_OMIT, the timestamps remain unchanged; no
permission check shall be performed for the file itself but may be
performed for the path prefix (an earlier patch from pluknet returned
success immediately).

Otherwise, the above is correct.

Note that if one field is UTIME_NOW and the other is UTIME_OMIT, there
is no special case: the caller must be owner or root.

-- 
Jilles Tjoelker

From owner-freebsd-arch@FreeBSD.ORG  Sat Mar  3 06:24:44 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 5845E1065674;
	Sat,  3 Mar 2012 06:24:44 +0000 (UTC)
	(envelope-from tim@kientzle.com)
Received: from monday.kientzle.com (99-115-135-74.uvs.sntcca.sbcglobal.net
	[99.115.135.74])
	by mx1.freebsd.org (Postfix) with ESMTP id 3064C8FC0C;
	Sat,  3 Mar 2012 06:24:44 +0000 (UTC)
Received: (from root@localhost)
	by monday.kientzle.com (8.14.4/8.14.4) id q235PWPE069176;
	Sat, 3 Mar 2012 05:25:32 GMT (envelope-from tim@kientzle.com)
Received: from [192.168.2.119] (CiscoE3000 [192.168.1.65])
	by kientzle.com with SMTP id najztczxenqiu28mc823p34z56;
	Sat, 03 Mar 2012 05:25:32 +0000 (UTC)
	(envelope-from tim@kientzle.com)
Mime-Version: 1.0 (Apple Message framework v1257)
Content-Type: text/plain; charset=windows-1252
From: Tim Kientzle <tim@kientzle.com>
In-Reply-To: <20120302171647.GA29850@zim.MIT.EDU>
Date: Fri, 2 Mar 2012 21:25:32 -0800
Content-Transfer-Encoding: quoted-printable
Message-Id: <B1CA3621-0760-4A2D-BF52-F12DA2164810@kientzle.com>
References: <4F3C2D2D.5000402@FreeBSD.org> <4F3E78BA.4060203@FreeBSD.org>
	<864nupcuvl.fsf@ds4.des.no> <4F3E7B5A.20103@FreeBSD.org>
	<86zkchbff6.fsf@ds4.des.no> <4F3EADB5.7060008@FreeBSD.org>
	<20120223170918.GA79013@zim.MIT.EDU>
	<201202231822.q1NIMQOd020804@hergotha.csail.mit.edu>
	<201202231926.q1NJQPFa021654@hergotha.csail.mit.edu>
	<86d3958cqi.fsf@ds4.des.no> <20120302171647.GA29850@zim.MIT.EDU>
To: David Schultz <das@freebsd.org>
X-Mailer: Apple Mail (2.1257)
Cc: =?iso-8859-1?Q?Dag-Erling_Sm=F8rgrav?= <des@des.no>,
	Garrett Wollman <wollman@hergotha.csail.mit.edu>, arch@freebsd.org
Subject: Re: bsd/citrus iconv
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2012 06:24:44 -0000


On Mar 2, 2012, at 9:16 AM, David Schultz wrote:

> On Thu, Feb 23, 2012, Dag-Erling Sm?rgrav wrote:
>> Garrett Wollman <wollman@hergotha.csail.mit.edu> writes:
>>> You missed the bit on the next page:
>>>=20
>>> 	It is unspecified whether the libraries libc.a, libm.a,
>>> 	librt.a, libpthread.a, libl.a, liby.a, or libxnet exist as
>>> 	regular files. The implementation may accept as -l operands
>>> 	names of objects that do not exist as regular files.
>>=20
>> That's entirely academic unless you want to modify gcc and clang to
>> automatically pull in libiconv.  The point is that if the iconv
>> extension is implemented, it must be available without requiring
>> additional -l options.
>=20
> If the linker included libiconv automatically, would it be
> possible to switch iconv implementations without recompiling, by
> using libmap.conf?  Or is the ABI (e.g., type of iconv_t)
> incompatible?  =85.

Very incompatible.  The functions actually have
different names in the library.

So switching implementations via library mapping is not
going to work.

Tim


From owner-freebsd-arch@FreeBSD.ORG  Sat Mar  3 12:48:12 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id B2F8A106564A;
	Sat,  3 Mar 2012 12:48:12 +0000 (UTC)
	(envelope-from rmh.aybabtu@gmail.com)
Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com
	[209.85.212.182])
	by mx1.freebsd.org (Postfix) with ESMTP id D17548FC17;
	Sat,  3 Mar 2012 12:48:11 +0000 (UTC)
Received: by wibhn6 with SMTP id hn6so1448630wib.13
	for <multiple recipients>; Sat, 03 Mar 2012 04:48:11 -0800 (PST)
Received-SPF: pass (google.com: domain of rmh.aybabtu@gmail.com designates
	10.180.99.100 as permitted sender) client-ip=10.180.99.100; 
Authentication-Results: mr.google.com;
	spf=pass (google.com: domain of rmh.aybabtu@gmail.com
	designates 10.180.99.100 as permitted sender)
	smtp.mail=rmh.aybabtu@gmail.com;
	dkim=pass header.i=rmh.aybabtu@gmail.com
Received: from mr.google.com ([10.180.99.100])
	by 10.180.99.100 with SMTP id ep4mr3839297wib.7.1330778891004 (num_hops
	= 1); Sat, 03 Mar 2012 04:48:11 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=sender:date:from:to:cc:subject:message-id:references:mime-version
	:content-type:content-disposition:in-reply-to:user-agent;
	bh=hah0jQlc7/2D5VHQmuISp5eIqbGpXACkxnud/Wcy1u4=;
	b=F4Z6R2hIvCENnlUaKswTirjNOyUXjJmITpooBMt3xUNAJUdtaExLkpOYfTsee0a08E
	ZxoWD7E/hS75fmiESEe7p4Z5mNQFo2a8w+AiYXMRLjx2L/IDlOLIYHA+5UZV487scE4y
	MgekF9x4ED15Z8Xm54la/KXC9zFIidfOS6JvEy6k+/8bf73Ll+c/LbJ8iKhhLfdS2w51
	p+qTXtqhOo08UNBZ5OndBL6hATIeloc4R87rGUqlFwMVHc++PSBXRhsAO0Qx1aQJWe8p
	u1Z7E3s53JsxZUz2D6xH723xRT431V5Y5fIQGl2kaRrod38tXRQicWvd/QZ/RhDaNsdN
	zO+A==
Received: by 10.180.99.100 with SMTP id ep4mr3026961wib.7.1330778890878;
	Sat, 03 Mar 2012 04:48:10 -0800 (PST)
Received: from thorin (7.Red-81-38-33.dynamicIP.rima-tde.net. [81.38.33.7])
	by mx.google.com with ESMTPS id gp8sm9306580wib.5.2012.03.03.04.48.07
	(version=TLSv1/SSLv3 cipher=OTHER);
	Sat, 03 Mar 2012 04:48:09 -0800 (PST)
Sender: Robert Millan <rmh.aybabtu@gmail.com>
Received: from rmh by thorin with local (Exim 4.72)
	(envelope-from <rmh@thorin>)
	id 1S3oNZ-0001IX-WC; Sat, 03 Mar 2012 13:48:06 +0100
Date: Sat, 3 Mar 2012 13:48:05 +0100
From: Robert Millan <rmh@freebsd.org>
To: Hans Petter Selasky <hselasky@c2i.net>
Message-ID: <20120303124805.GA4725@thorin>
References: <CAOfDtXNDXV-hM5t56XKj6-m-Bc=SSZsmB7JnEXsoDGdF2DEuqw@mail.gmail.com>
	<201202181720.27135.hselasky@c2i.net>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
	protocol="application/pgp-signature"; boundary="ZfOjI3PrQbgiZnxM"
Content-Disposition: inline
In-Reply-To: <201202181720.27135.hselasky@c2i.net>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: Kostik Belousov <kostikbel@gmail.com>, Adrian Chadd <adrian@freebsd.org>,
	freebsd-usb@freebsd.org, freebsd-arch@freebsd.org
Subject: Re: Exclude USB drivers from main kernel image?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2012 12:48:12 -0000


--ZfOjI3PrQbgiZnxM
Content-Type: multipart/mixed; boundary="EeQfGwPcQSOJBaQU"
Content-Disposition: inline


--EeQfGwPcQSOJBaQU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sat, Feb 18, 2012 at 05:20:27PM +0100, Hans Petter Selasky wrote:
> The /etc/devd/usb.conf is regularly updated, though not automatically. It=
=20
> should auto-load most kind of devices. Only additional case that comes to=
 mind=20
> is that USB serial console will not be active until devd has executed, if=
 that=20
> is enabled.

If early USB serial output is desired, it can be enabled by enabling the
module in bootloader. Is that an acceptable trade-off?

> Your patch looks OK. Adding ARCH @
>=20
> Instead of commenting out, I would just remove those lines.

Here's a new patch that removes the lines instead of commenting them out.

Consistently with that, it also removes a few lines which were already
commented out, using the same criteria.

Also, it disables a few more USB drivers. Due to an oversight my previous
patch didn't disable all drivers that devd can handle.

Patch is tested with "make universe" on HEAD.

--=20
Robert Millan

--EeQfGwPcQSOJBaQU
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="usb.diff"
Content-Transfer-Encoding: quoted-printable

Index: sys/amd64/conf/GENERIC
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/amd64/conf/GENERIC	(revision 232404)
+++ sys/amd64/conf/GENERIC	(working copy)
@@ -302,39 +302,8 @@
 device		ehci		# EHCI PCI->USB interface (USB 2.0)
 device		xhci		# XHCI PCI->USB interface (USB 3.0)
 device		usb		# USB Bus (required)
-#device		udbp		# USB Double Bulk Pipe devices (needs netgraph)
-device		uhid		# "Human Interface Devices"
 device		ukbd		# Keyboard
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		ums		# Mouse
-device		urio		# Diamond Rio 500 MP3 player
-# USB Serial devices
-device		u3g		# USB-based 3G modems (Option, Huawei, Sierra)
-device		uark		# Technologies ARK3116 based serial adapters
-device		ubsa		# Belkin F5U103 and compatible serial adapters
-device		uftdi		# For FTDI usb serial adapters
-device		uipaq		# Some WinCE based devices
-device		uplcom		# Prolific PL-2303 serial adapters
-device		uslcom		# SI Labs CP2101/CP2102 serial adapters
-device		uvisor		# Visor and Palm devices
-device		uvscom		# USB serial support for DDI pocket's PHS
-# USB Ethernet, requires miibus
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
-device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-# USB Wireless
-device		rum		# Ralink Technology RT2501USB wireless NICs
-device		run		# Ralink Technology RT2700/RT2800/RT3000 NICs.
-device		uath		# Atheros AR5523 wireless NICs
-device		upgt		# Conexant/Intersil PrismGT wireless NICs.
-device		ural		# Ralink Technology RT2500USB wireless NICs
-device		urtw		# Realtek RTL8187B/L wireless NICs
-device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
=20
 # FireWire support
 device		firewire	# FireWire bus code
@@ -350,7 +319,6 @@
 device		snd_es137x	# Ensoniq AudioPCI ES137x
 device		snd_hda		# Intel High Definition Audio
 device		snd_ich		# Intel, NVidia and other ICH AC'97 Audio
-device		snd_uaudio	# USB Audio
 device		snd_via8233	# VIA VT8233x Audio
=20
 # MMC/SD
Index: sys/arm/conf/KB920X
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/arm/conf/KB920X	(revision 232404)
+++ sys/arm/conf/KB920X	(working copy)
@@ -99,34 +99,8 @@
 options 	USB_DEBUG	# enable debug msgs
 device		ohci		# OHCI localbus->USB interface
 device		usb		# USB Bus (required)
-#device		udbp		# USB Double Bulk Pipe devices
-device		uhid		# "Human Interface Devices"
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		urio		# Diamond Rio 500 MP3 player
-# USB Serial devices
-device		uark		# Technologies ARK3116 based serial adapters
-device		ubsa		# Belkin F5U103 and compatible serial adapters
-device		uftdi		# For FTDI usb serial adapters
-device		uipaq		# Some WinCE based devices
-device		uplcom		# Prolific PL-2303 serial adapters
-device		uslcom		# SI Labs CP2101/CP2102 serial adapters
-device		uvisor		# Visor and Palm devices
-device		uvscom		# USB serial support for DDI pocket's PHS
-# USB Ethernet, requires miibus
-device		miibus
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
-device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-# USB Wireless
-device		rum		# Ralink Technology RT2501USB wireless NICs
-device		uath		# Atheros AR5523 wireless NICs
-device		ural		# Ralink Technology RT2500USB wireless NICs
-device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
+device		miibus		# Required for USB Ethernet
 # SCSI peripherals
 device		scbus		# SCSI bus (required for SCSI)
 device		da		# Direct Access (disks)
Index: sys/arm/conf/QILA9G20
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/arm/conf/QILA9G20	(revision 232404)
+++ sys/arm/conf/QILA9G20	(working copy)
@@ -124,26 +124,8 @@
 device		ohci		# OHCI localbus->USB interface
 device		usb		# USB Bus (required)
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		uhid		# "Human Interface Devices"
-#device		ulpt		# Printer
-#device		udbp		# USB Double Bulk Pipe devices
+device		miibus		# Required for USB Ethernet
=20
-# USB Ethernet, requires miibus
-device		miibus
-#device		aue		# ADMtek USB Ethernet
-#device		axe		# ASIX Electronics USB Ethernet
-#device		cdce		# Generic USB over Ethernet
-#device		cue		# CATC USB Ethernet
-#device		kue		# Kawasaki LSI USB Ethernet
-#device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-
-# USB Wireless
-#device		rum		# Ralink Technology RT2501USB wireless NICs
-#device		uath		# Atheros AR5523 wireless NICs
-#device		ural		# Ralink Technology RT2500USB wireless NICs
-#device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
-
 # Wireless NIC cards
 #device		wlan		# 802.11 support
 #device		wlan_wep	# 802.11 WEP support
Index: sys/arm/conf/HL200
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/arm/conf/HL200	(revision 232404)
+++ sys/arm/conf/HL200	(working copy)
@@ -98,35 +98,9 @@
 options 	USB_DEBUG	# enable debug msgs
 device		ohci		# OHCI localbus->USB interface
 device		usb		# USB Bus (required)
-#device		udbp		# USB Double Bulk Pipe devices
-device		uhid		# "Human Interface Devices"
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		urio		# Diamond Rio 500 MP3 player
-# USB Serial devices
-device		uark		# Technologies ARK3116 based serial adapters
-device		ubsa		# Belkin F5U103 and compatible serial adapters
 #device		ubser		# not yet converted.
-device		uftdi		# For FTDI usb serial adapters
-device		uipaq		# Some WinCE based devices
-device		uplcom		# Prolific PL-2303 serial adapters
-device		uslcom		# SI Labs CP2101/CP2102 serial adapters
-device		uvisor		# Visor and Palm devices
-device		uvscom		# USB serial support for DDI pocket's PHS
-# USB Ethernet, requires miibus
-device		miibus
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
-device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-# USB Wireless
-device		rum		# Ralink Technology RT2501USB wireless NICs
-device		uath		# Atheros AR5523 wireless NICs
-device		ural		# Ralink Technology RT2500USB wireless NICs
-device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
+device		miibus		# Required for USB Ethernet
 # SCSI peripherals
 device		scbus		# SCSI bus (required for SCSI)
 device		da		# Direct Access (disks)
Index: sys/arm/conf/HL201
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/arm/conf/HL201	(revision 232404)
+++ sys/arm/conf/HL201	(working copy)
@@ -99,25 +99,8 @@
 # USB support
 #device		ohci		# OHCI localbus->USB interface
 device		usb		# USB Bus (required)
-#device		udbp		# USB Double Bulk Pipe devices
-device		uhid		# "Human Interface Devices"
-#device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da
-
-# USB Ethernet, requires miibus
-device		miibus
-#device		aue		# ADMtek USB Ethernet
-#device		axe		# ASIX Electronics USB Ethernet
-#device		cdce		# Generic USB over Ethernet
-#device		cue		# CATC USB Ethernet
-#device		kue		# Kawasaki LSI USB Ethernet
-#device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-# USB Wireless
-#device		rum		# Ralink Technology RT2501USB wireless NICs
-#device		uath		# Atheros AR5523 wireless NICs
-#device		ural		# Ralink Technology RT2500USB wireless NICs
-#device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
+device		miibus		# Required for USB Ethernet
 # SCSI peripherals
 device		scbus		# SCSI bus (required for SCSI)
 device		da		# Direct Access (disks)
Index: sys/arm/conf/SAM9G20EK
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/arm/conf/SAM9G20EK	(revision 232404)
+++ sys/arm/conf/SAM9G20EK	(working copy)
@@ -124,26 +124,8 @@
 device		ohci		# OHCI localbus->USB interface
 device		usb		# USB Bus (required)
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		uhid		# "Human Interface Devices"
-#device		ulpt		# Printer
-#device		udbp		# USB Double Bulk Pipe devices
+device		miibus		# Required for USB Ethernet
=20
-# USB Ethernet, requires miibus
-device		miibus
-#device		aue		# ADMtek USB Ethernet
-#device		axe		# ASIX Electronics USB Ethernet
-#device		cdce		# Generic USB over Ethernet
-#device		cue		# CATC USB Ethernet
-#device		kue		# Kawasaki LSI USB Ethernet
-#device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-
-# USB Wireless
-#device		rum		# Ralink Technology RT2501USB wireless NICs
-#device		uath		# Atheros AR5523 wireless NICs
-#device		ural		# Ralink Technology RT2500USB wireless NICs
-#device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
-
 # Wireless NIC cards
 #device		wlan		# 802.11 support
 #device		wlan_wep	# 802.11 WEP support
Index: sys/i386/conf/XBOX
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/i386/conf/XBOX	(revision 232404)
+++ sys/i386/conf/XBOX	(working copy)
@@ -80,20 +80,10 @@
 #device		uhci		# UHCI PCI->USB interface
 device		ohci		# OHCI PCI->USB interface
 device		usb		# USB Bus (required)
-device		uhid		# "Human Interface Devices"
 device		ukbd		# Keyboard
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		ums		# Mouse
-device		urio		# Diamond Rio 500 MP3 player
=20
 device		miibus
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
-device		rue		# RealTek RTL8150 USB Ethernet
=20
 device		sound
 device		snd_ich		# nForce audio
Index: sys/i386/conf/GENERIC
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/i386/conf/GENERIC	(revision 232404)
+++ sys/i386/conf/GENERIC	(working copy)
@@ -315,39 +315,8 @@
 device		ehci		# EHCI PCI->USB interface (USB 2.0)
 device		xhci		# XHCI PCI->USB interface (USB 3.0)
 device		usb		# USB Bus (required)
-#device		udbp		# USB Double Bulk Pipe devices (needs netgraph)
-device		uhid		# "Human Interface Devices"
 device		ukbd		# Keyboard
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		ums		# Mouse
-device		urio		# Diamond Rio 500 MP3 player
-# USB Serial devices
-device		u3g		# USB-based 3G modems (Option, Huawei, Sierra)
-device		uark		# Technologies ARK3116 based serial adapters
-device		ubsa		# Belkin F5U103 and compatible serial adapters
-device		uftdi		# For FTDI usb serial adapters
-device		uipaq		# Some WinCE based devices
-device		uplcom		# Prolific PL-2303 serial adapters
-device		uslcom		# SI Labs CP2101/CP2102 serial adapters
-device		uvisor		# Visor and Palm devices
-device		uvscom		# USB serial support for DDI pocket's PHS
-# USB Ethernet, requires miibus
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
-device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-# USB Wireless
-device		rum		# Ralink Technology RT2501USB wireless NICs
-device		run		# Ralink Technology RT2700/RT2800/RT3000 NICs.
-device		uath		# Atheros AR5523 wireless NICs
-device		upgt		# Conexant/Intersil PrismGT wireless NICs.
-device		ural		# Ralink Technology RT2500USB wireless NICs
-device		urtw		# Realtek RTL8187B/L wireless NICs
-device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
=20
 # FireWire support
 device		firewire	# FireWire bus code
@@ -363,7 +332,6 @@
 device		snd_es137x	# Ensoniq AudioPCI ES137x
 device		snd_hda		# Intel High Definition Audio
 device		snd_ich		# Intel, NVidia and other ICH AC'97 Audio
-device		snd_uaudio	# USB Audio
 device		snd_via8233	# VIA VT8233x Audio
=20
 # MMC/SD
Index: sys/ia64/conf/GENERIC
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/ia64/conf/GENERIC	(revision 232404)
+++ sys/ia64/conf/GENERIC	(working copy)
@@ -127,11 +127,8 @@
 device		ehci		# EHCI host controller
 device		ohci		# OHCI PCI->USB interface
 device		uhci		# UHCI PCI->USB interface
-device		uhid		# Human Interface Devices
 device		ukbd		# Keyboard
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage (need scbus & da)
-device		ums		# Mouse
=20
 # PCI Ethernet NICs.
 device		de		# DEC/Intel DC21x4x (``Tulip'')
@@ -162,25 +159,6 @@
 device		vge		# VIA VT612x gigabit Ethernet
 device		xl		# 3Com 3c90x ("Boomerang", "Cyclone")
=20
-# USB Ethernet
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
-device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-
-# USB Serial
-device		uark		# Technologies ARK3116 based serial adapters
-device		ubsa		# Belkin F5U103 and compatible serial adapters
-device		uftdi		# For FTDI usb serial adapters
-device		uipaq		# Some WinCE based devices
-device		uplcom		# Prolific PL-2303 serial adapters
-device		uslcom		# SI Labs CP2101/CP2102 serial adapters
-device		uvisor		# Visor and Palm devices
-device		uvscom		# USB serial support for DDI pocket's PHS
-
 # Wireless NIC cards.
 # The wlan(4) module assumes this, so just define it so it
 # at least correctly loads.
Index: sys/mips/conf/XLRN32
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/mips/conf/XLRN32	(revision 232404)
+++ sys/mips/conf/XLRN32	(working copy)
@@ -104,9 +104,7 @@
 device          ehci            # EHCI PCI->USB interface (USB 2.0)
 device          usb             # USB Bus (required)
 options 	USB_DEBUG	# enable debug msgs
-#device         udbp            # USB Double Bulk Pipe devices
 #device          ugen            # Generic
-#device          uhid            # "Human Interface Devices"
 device          umass           # Disks/Mass storage - Requires scbus and =
da
=20
 #device		cfi
Index: sys/mips/conf/XLR64
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/mips/conf/XLR64	(revision 232404)
+++ sys/mips/conf/XLR64	(working copy)
@@ -103,7 +103,6 @@
 device		ehci		# EHCI PCI->USB interface (USB 2.0)
 device		usb		# USB Bus (required)
 options 	USB_DEBUG		# enable debug msgs
-#device		uhid		# "Human Interface Devices"
 device		umass		# Disks/Mass storage - Requires scbus and da
=20
 #device		cfi
Index: sys/mips/conf/std.XLP
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/mips/conf/std.XLP	(revision 232404)
+++ sys/mips/conf/std.XLP	(working copy)
@@ -81,7 +81,6 @@
 device		ehci			# EHCI PCI->USB interface (USB 2.0)
 #options 	USB_DEBUG		# enable debug msgs
 #device		ugen			# Generic
-#device		uhid			# "Human Interface Devices"
 device		umass			# Requires scbus and da
=20
 options 	FDT
Index: sys/mips/conf/XLR
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/mips/conf/XLR	(revision 232404)
+++ sys/mips/conf/XLR	(working copy)
@@ -128,7 +128,6 @@
 device		ehci		# EHCI PCI->USB interface (USB 2.0)
 device		usb		# USB Bus (required)
 #options 	USB_DEBUG	# enable debug msgs
-#device		uhid		# "Human Interface Devices"
 device		umass		# Disks/Mass storage - Requires scbus and da
=20
 #device		cfi
Index: sys/mips/conf/OCTEON1
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/mips/conf/OCTEON1	(revision 232404)
+++ sys/mips/conf/OCTEON1	(working copy)
@@ -267,32 +267,4 @@
 device		ohci		# OHCI PCI->USB interface
 device		ehci		# EHCI PCI->USB interface (USB 2.0)
 device		usb		# USB Bus (required)
-#device		udbp		# USB Double Bulk Pipe devices
-device		uhid		# "Human Interface Devices"
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		ums		# Mouse
-device		urio		# Diamond Rio 500 MP3 player
-# USB Serial devices
-device		u3g		# USB-based 3G modems (Option, Huawei, Sierra)
-device		uark		# Technologies ARK3116 based serial adapters
-device		ubsa		# Belkin F5U103 and compatible serial adapters
-device		uftdi		# For FTDI usb serial adapters
-device		uipaq		# Some WinCE based devices
-device		uplcom		# Prolific PL-2303 serial adapters
-device		uslcom		# SI Labs CP2101/CP2102 serial adapters
-device		uvisor		# Visor and Palm devices
-device		uvscom		# USB serial support for DDI pocket's PHS
-# USB Ethernet, requires miibus
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
-device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-# USB Wireless
-device		rum		# Ralink Technology RT2501USB wireless NICs
-device		uath		# Atheros AR5523 wireless NICs
-device		ural		# Ralink Technology RT2500USB wireless NICs
-device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
Index: sys/pc98/conf/GENERIC
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/pc98/conf/GENERIC	(revision 232404)
+++ sys/pc98/conf/GENERIC	(working copy)
@@ -239,36 +239,9 @@
 #device		ohci		# OHCI PCI->USB interface
 #device		ehci		# EHCI PCI->USB interface (USB 2.0)
 #device		usb		# USB Bus (required)
-#device		udbp		# USB Double Bulk Pipe devices (needs netgraph)
-#device		uhid		# "Human Interface Devices"
 #device		ukbd		# Keyboard
-#device		ulpt		# Printer
 #device		umass		# Disks/Mass storage - Requires scbus and da
-#device		ums		# Mouse
-#device		urio		# Diamond Rio 500 MP3 player
-# USB Serial devices
-#device		uark		# Technologies ARK3116 based serial adapters
-#device		ubsa		# Belkin F5U103 and compatible serial adapters
 #device		ubser		# BWCT console serial adapters
-#device		uftdi		# For FTDI usb serial adapters
-#device		uipaq		# Some WinCE based devices
-#device		uplcom		# Prolific PL-2303 serial adapters
-#device		uslcom		# SI Labs CP2101/CP2102 serial adapters
-#device		uvisor		# Visor and Palm devices
-#device		uvscom		# USB serial support for DDI pocket's PHS
-# USB Ethernet, requires miibus
-#device		aue		# ADMtek USB Ethernet
-#device		axe		# ASIX Electronics USB Ethernet
-#device		cdce		# Generic USB over Ethernet
-#device		cue		# CATC USB Ethernet
-#device		kue		# Kawasaki LSI USB Ethernet
-#device		rue		# RealTek RTL8150 USB Ethernet
-#device		udav		# Davicom DM9601E USB
-# USB Wireless
-#device		rum		# Ralink Technology RT2501USB wireless NICs
-#device		uath		# Atheros AR5523 wireless NICs
-#device		ural		# Ralink Technology RT2500USB wireless NICs
-#device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
=20
 # FireWire support
 #device		firewire	# FireWire bus code
@@ -280,4 +253,3 @@
 #device		snd_mss		# Microsoft Sound System
 #device		"snd_sb16"	# Sound Blaster 16
 #device		snd_sbc		# Sound Blaster
-#device		snd_uaudio	# USB Audio
Index: sys/powerpc/conf/GENERIC64
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/powerpc/conf/GENERIC64	(revision 232404)
+++ sys/powerpc/conf/GENERIC64	(working copy)
@@ -156,19 +156,9 @@
 device		ohci		# OHCI PCI->USB interface
 device		ehci		# EHCI PCI->USB interface
 device		usb		# USB Bus (required)
-device		uhid		# "Human Interface Devices"
 device		ukbd		# Keyboard
 options 	KBD_INSTALL_CDEV # install a CDEV entry in /dev
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da0
-device		ums		# Mouse
-device		urio		# Diamond Rio 500 MP3 player
-# USB Ethernet
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
=20
 # Wireless NIC cards
 options         IEEE80211_SUPPORT_MESH
@@ -197,5 +187,4 @@
 # Sound support
 device		sound		# Generic sound driver (required)
 device		snd_ai2s	# Apple I2S audio
-device		snd_uaudio	# USB Audio
=20
Index: sys/powerpc/conf/GENERIC
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/powerpc/conf/GENERIC	(revision 232404)
+++ sys/powerpc/conf/GENERIC	(working copy)
@@ -159,20 +159,9 @@
 device		ohci		# OHCI PCI->USB interface
 device		ehci		# EHCI PCI->USB interface
 device		usb		# USB Bus (required)
-device		uhid		# "Human Interface Devices"
 device		ukbd		# Keyboard
 options 	KBD_INSTALL_CDEV # install a CDEV entry in /dev
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da0
-device		ums		# Mouse
-device		atp		# Apple USB touchpad
-device		urio		# Diamond Rio 500 MP3 player
-# USB Ethernet
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
=20
 # Wireless NIC cards
 options		IEEE80211_SUPPORT_MESH
@@ -205,5 +194,4 @@
 device		sound		# Generic sound driver (required)
 device		snd_ai2s	# Apple I2S audio
 device		snd_davbus	# Apple DAVBUS audio
-device		snd_uaudio	# USB Audio
=20
Index: sys/sparc64/conf/GENERIC
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- sys/sparc64/conf/GENERIC	(revision 232404)
+++ sys/sparc64/conf/GENERIC	(working copy)
@@ -235,35 +235,8 @@
 device		ohci		# OHCI PCI->USB interface
 device		ehci		# EHCI PCI->USB interface (USB 2.0)
 device		usb		# USB Bus (required)
-#device		udbp		# USB Double Bulk Pipe devices (needs netgraph)
-device		uhid		# "Human Interface Devices"
 device		ukbd		# Keyboard
-device		ulpt		# Printer
 device		umass		# Disks/Mass storage - Requires scbus and da
-device		ums		# Mouse
-device		urio		# Diamond Rio 500 MP3 player
-# USB Serial devices
-device		uark		# Technologies ARK3116 based serial adapters
-device		ubsa		# Belkin F5U103 and compatible serial adapters
-device		uftdi		# For FTDI usb serial adapters
-device		uipaq		# Some WinCE based devices
-device		uplcom		# Prolific PL-2303 serial adapters
-device		uslcom		# SI Labs CP2101/CP2102 serial adapters
-device		uvisor		# Visor and Palm devices
-device		uvscom		# USB serial support for DDI pocket's PHS
-# USB Ethernet, requires miibus
-device		aue		# ADMtek USB Ethernet
-device		axe		# ASIX Electronics USB Ethernet
-device		cdce		# Generic USB over Ethernet
-device		cue		# CATC USB Ethernet
-device		kue		# Kawasaki LSI USB Ethernet
-device		rue		# RealTek RTL8150 USB Ethernet
-device		udav		# Davicom DM9601E USB
-# USB Wireless
-device		rum		# Ralink Technology RT2501USB wireless NICs
-device		uath		# Atheros AR5523 wireless NICs
-device		ural		# Ralink Technology RT2500USB wireless NICs
-device		zyd		# ZyDAS zd1211/zd1211b wireless NICs
=20
 # FireWire support
 device		firewire	# FireWire bus code
@@ -279,4 +252,3 @@
 device		snd_audiocs	# Crystal Semiconductor CS4231
 device		snd_es137x	# Ensoniq AudioPCI ES137x
 device		snd_t4dwave	# Acer Labs M5451
-device		snd_uaudio	# USB Audio

--EeQfGwPcQSOJBaQU--

--ZfOjI3PrQbgiZnxM
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/kFreeBSD)

iQIcBAEBCAAGBQJPUhMEAAoJELd1onhloKnORW0P/RLDg4ie4j4H0zdSTVSp2VdZ
0rKiqr4Umme0zP4weE7who3d+TN96VSpb0PVLHapr/7/24PspdBZL5fpKRe6ewe1
036TXs5E6LzxBJEUDDoY6Jh1hQaDwvftLi1LTSXF4FIluzs01ySXeoFx7eDHKtyg
zpevczl5D5Bi97lpBdLQWHgKk0S+0afcw4CA1CfFGSuTkslioMw+HSwC1pp8fGKY
vzINI8PWGjEN5z8oGjT+6RktTot8TpRVb2Yhe8V0T5N4AJHMTg0kKEya+wWNiLd/
f8Ur4r8mQPCXma4Etb0NNpMXzCWXaHmI6V9HT60TCuF+PN8pyYakaesJI1k5hYW0
tJf9h32QAtfl2CTtMRJ4/ZfSFBOtJCVpMd3okwm0b4nKLmNsmZ8KpvowtN+7lfQe
+DxPQalBBwSEAbbAF1aSdvLQ7GfnUTxlWCZZgDVlVnOBUXmb2ar04BiW+8ZiXHui
7TcbaQK9wC293U6hePhCUlkW+OzgtKVz39J+DCH3DBQSqGyG9I6NIAiwG9xIunwV
941M521a/8SLZvK1+d4vKxwsb9j14z8Vpd3XyYPDg/8fCsGISiPqRGeMqVUu47VC
N4eMO2Qanv1rYE5l0ChC/kcnB/rRbitM/+CG2d9+XA0M3gIf8zRCGdIwPpuvBvuk
FHJ8GL02AfLYso/rsX9F
=Ek1B
-----END PGP SIGNATURE-----

--ZfOjI3PrQbgiZnxM--

From owner-freebsd-arch@FreeBSD.ORG  Sat Mar  3 14:17:50 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id F0670106564A
	for <freebsd-arch@freebsd.org>; Sat,  3 Mar 2012 14:17:50 +0000 (UTC)
	(envelope-from johnandsara2@cox.net)
Received: from eastrmfepo101.cox.net (eastrmfepo101.cox.net [68.230.241.213])
	by mx1.freebsd.org (Postfix) with ESMTP id 8A6D98FC08
	for <freebsd-arch@freebsd.org>; Sat,  3 Mar 2012 14:17:49 +0000 (UTC)
Received: from eastrmimpo110.cox.net ([68.230.241.223])
	by eastrmfepo101.cox.net
	(InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id
	<20120303141738.KEKX18243.eastrmfepo101.cox.net@eastrmimpo110.cox.net>;
	Sat, 3 Mar 2012 09:17:38 -0500
Received: from [192.168.3.22] ([70.177.172.35])
	by eastrmimpo110.cox.net with bizsmtp
	id h2Hd1i00W0mAvba022HeRe; Sat, 03 Mar 2012 09:17:38 -0500
X-CT-Class: Clean
X-CT-Score: 0.00
X-CT-RefID: str=0001.0A020203.4F522802.0081,ss=1,re=0.000,fgs=0
X-CT-Spam: 0
X-Authority-Analysis: v=1.1 cv=4+d3365FwXO39Q6CIaohezzFfUymJ8jBUV6iqnnMg0E=
	c=1 sm=1 a=f5xKl4ys9bwA:10 a=AeehsHawFTcA:10 a=G8Uczd0VNMoA:10
	a=Wajolswj7cQA:10 a=8nJEP1OIZ-IA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:17
	a=kviXuzpPAAAA:8 a=mMHHQ8NI51iLL4NWVPUA:9 a=wPNLvfGTeEIA:10
	a=4vB-4DCPJfMA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:117
X-CM-Score: 0.00
Authentication-Results: cox.net; none
Message-ID: <4F5227FE.3080708@cox.net>
Date: Sat, 03 Mar 2012 09:17:34 -0500
From: "John D. Hendrickson and Sara Darnell" <johnandsara2@cox.net>
User-Agent: Thunderbird 2.0.0.24 (X11/20100228)
MIME-Version: 1.0
To: Ed Schouten <ed@80386.nl>
References: <4E18ABB1.4010304@cox.net> <20110709194639.GA4914@elie>
	<4E18EE60.7010402@cox.net> <20110710151354.GA25475@r500-debian>
	<4F50FD4D.9000106@cox.net> <20120302183138.GC32748@hoeg.nl>
In-Reply-To: <20120302183138.GC32748@hoeg.nl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-arch@freebsd.org
Subject: Re: dep-trace  v.  tsort  (mac ports depends support)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: johnandsara2@cox.net
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2012 14:17:51 -0000

Hi and thanks for looking !

Yes and no (no),  I thought of that.

Who knows to order depends so tsort can order them in non-topological order as output?  Who has time 
to SVG plot program compile order (or pkg) depends like airports and airplanes and draw arrows 
between them?

Another issue of pre-positioning each sublists in port files : you are SOL if there is any loss of 
order before tsort gets them.

Another issue (one port not knowing the full sublist of the other) (there are probably more I'll 
stop there).

Have Fun!

-- John

Ed Schouten wrote:
> Hi John,
> 
> * John D. Hendrickson and Sara Darnell <johnandsara2@cox.net>, 20120302 18:03:
>> BSD and Apple needs tsort(1) for portage still I believe.
>>
>> Topological sorting isn't quite right packaging.
>>
>> [...]
>>
>> (ie, for portage: you need to dl source, order of compile may be
>> required, sometimes gets missing message or "loop in depends" message
>> when attempting to compile and install pkg)
> 
> But wait. Isn't this because of mis-use of tsort(1) by portage?
> 
> tsort(1) can give you any ordering you like, as long as you make sure
> your input graph is correct.
> 


From owner-freebsd-arch@FreeBSD.ORG  Sat Mar  3 15:12:40 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 8CCCB106564A
	for <freebsd-arch@freebsd.org>; Sat,  3 Mar 2012 15:12:40 +0000 (UTC)
	(envelope-from ed@hoeg.nl)
Received: from mx0.hoeg.nl (mx0.hoeg.nl [IPv6:2a01:4f8:101:5343::aa])
	by mx1.freebsd.org (Postfix) with ESMTP id 24BAA8FC08
	for <freebsd-arch@freebsd.org>; Sat,  3 Mar 2012 15:12:40 +0000 (UTC)
Received: by mx0.hoeg.nl (Postfix, from userid 1000)
	id 5C4272A28CCF; Sat,  3 Mar 2012 16:12:39 +0100 (CET)
Date: Sat, 3 Mar 2012 16:12:39 +0100
From: Ed Schouten <ed@80386.nl>
To: "John D. Hendrickson and Sara Darnell" <johnandsara2@cox.net>
Message-ID: <20120303151239.GF32748@hoeg.nl>
References: <4E18ABB1.4010304@cox.net> <20110709194639.GA4914@elie>
	<4E18EE60.7010402@cox.net> <20110710151354.GA25475@r500-debian>
	<4F50FD4D.9000106@cox.net> <20120302183138.GC32748@hoeg.nl>
	<4F5227FE.3080708@cox.net>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="Rln2GmQ7CFmDhc9B"
Content-Disposition: inline
In-Reply-To: <4F5227FE.3080708@cox.net>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-arch@freebsd.org
Subject: Re: dep-trace  v.  tsort  (mac ports depends support)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2012 15:12:40 -0000


--Rln2GmQ7CFmDhc9B
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi John,

* John D. Hendrickson and Sara Darnell <johnandsara2@cox.net>, 20120303 15:=
17:
> Who knows to order depends so tsort can order them in non-topological
> order as output?  Who has time to SVG plot program compile order (or
> pkg) depends like airports and airplanes and draw arrows between
> them?
>=20
> Another issue of pre-positioning each sublists in port files : you
> are SOL if there is any loss of order before tsort gets them.
>=20
> Another issue (one port not knowing the full sublist of the other)
> (there are probably more I'll stop there).

But the point is that if applications are looking for such things, they
are typically implemented in a higher-level programming language that
allows them to implement such features themselves, instead of relying on
a 1980s command line tool.

--=20
 Ed Schouten <ed@80386.nl>
 WWW: http://80386.nl/

--Rln2GmQ7CFmDhc9B
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iQIcBAEBAgAGBQJPUjTnAAoJEG5e2P40kaK7OvcP/13zepoBcf0y2NuGVZZsG/dB
nOcuBe+iN4NbOLc5NpQXpa1Jo25gXTlrHBx5QuYy12gRIlDHTOAGZNMhUZq0zUgX
BQM88POAHP5xUDV0d9icq4PKgeOfykoEOflXNezuefhc5ImCEASGd2fY2R1nw3T4
iwuj8jNV97wj4bZn0RIvByUBcSMVLOZnyRq+6SMxUUJ+IDlFOVtnsiwGuJH5qoI3
8ZezSMIw5QAUX+2xuDNXInQlRrYxLk/B6xcKTZWq0nqj2cy6uwRBTqtXdeiZ7KVT
u0jpZuvSSFf3oZRSa51Puqx0VVHGY9OVDHQigncCVc7wY62XwoBJ8nO3FG9wTndP
92yAvq7Fkf0UgQo1gpaROVtVioPRhS6h4g6o4cW0qtrs4XsygEcne3OOuywZiL+R
TLbah2ByZNS05RT8XkKQVs1Jn+6I+fDn0LNfobSvfZdtL6BCL2iJUgqgMPzRvDMR
uzAvhtMQhFXhG2BN0GFoQjLD18M8mC5xChs/hluAsmBHM5u5+Qqtu9c7YlqDlpzl
YArTAXqiGKW/5p1jrBFAVNpe3H9tZ2de10RI817dVfmXFEdn3oCSPP2Qn7Ih5FIh
90HfmC1JQ4oJSl4ITNXJDiV0kBlf1hwKq7Dr3kh0zt7eqWGupmdlueQeXat2JbzF
JQoJyCk5r76mqZ+P5Nb/
=+bRr
-----END PGP SIGNATURE-----

--Rln2GmQ7CFmDhc9B--

From owner-freebsd-arch@FreeBSD.ORG  Sat Mar  3 18:04:03 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 54C79106582B;
	Sat,  3 Mar 2012 18:04:00 +0000 (UTC)
	(envelope-from hselasky@c2i.net)
Received: from swip.net (mailfe05.c2i.net [212.247.154.130])
	by mx1.freebsd.org (Postfix) with ESMTP id 4B0808FC13;
	Sat,  3 Mar 2012 18:03:58 +0000 (UTC)
X-T2-Spam-Status: No, hits=-0.2 required=5.0 tests=ALL_TRUSTED,
	BAYES_50
Received: from [176.74.212.201] (account mc467741@c2i.net HELO
	laptop002.hselasky.homeunix.org)
	by mailfe05.swip.net (CommuniGate Pro SMTP 5.4.2)
	with ESMTPA id 244285303; Sat, 03 Mar 2012 19:03:51 +0100
From: Hans Petter Selasky <hselasky@c2i.net>
To: freebsd-arch@freebsd.org
Date: Sat, 3 Mar 2012 19:02:10 +0100
User-Agent: KMail/1.13.5 (FreeBSD/8.3-PRERELEASE; KDE/4.4.5; amd64; ; )
References: <CAOfDtXNDXV-hM5t56XKj6-m-Bc=SSZsmB7JnEXsoDGdF2DEuqw@mail.gmail.com>
	<201202181720.27135.hselasky@c2i.net>
	<20120303124805.GA4725@thorin>
In-Reply-To: <20120303124805.GA4725@thorin>
X-Face: 'mmZ:T{)),Oru^0c+/}w'`gU1$ubmG?lp!=R4Wy\ELYo2)@'UZ24N@d2+AyewRX}mAm; Yp
	|U[@, _z/([?1bCfM{_"B<.J>mICJCHAzzGHI{y7{%JVz%R~yJHIji`y>Y}k1C4TfysrsUI
	-%GU9V5]iUZF&nRn9mJ'?&>O
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Message-Id: <201203031902.11035.hselasky@c2i.net>
Cc: Kostik Belousov <kostikbel@gmail.com>, Adrian Chadd <adrian@freebsd.org>,
	freebsd-usb@freebsd.org, Robert Millan <rmh@freebsd.org>
Subject: Re: Exclude USB drivers from main kernel image?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2012 18:04:03 -0000

On Saturday 03 March 2012 13:48:05 Robert Millan wrote:
> On Sat, Feb 18, 2012 at 05:20:27PM +0100, Hans Petter Selasky wrote:
> > The /etc/devd/usb.conf is regularly updated, though not automatically. It
> > should auto-load most kind of devices. Only additional case that comes to
> > mind is that USB serial console will not be active until devd has
> > executed, if that is enabled.
> 
> If early USB serial output is desired, it can be enabled by enabling the
> module in bootloader. Is that an acceptable trade-off?
> 
> > Your patch looks OK. Adding ARCH @
> > 
> > Instead of commenting out, I would just remove those lines.
> 
> Here's a new patch that removes the lines instead of commenting them out.
> 
> Consistently with that, it also removes a few lines which were already
> commented out, using the same criteria.
> 
> Also, it disables a few more USB drivers. Due to an oversight my previous
> patch didn't disable all drivers that devd can handle.
> 
> Patch is tested with "make universe" on HEAD.

Hi,

Your patch looks good.

Are there any objections committing the patch attached to the previous e-mail?

--HPS

From owner-freebsd-arch@FreeBSD.ORG  Sat Mar  3 18:42:12 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id D7900106564A;
	Sat,  3 Mar 2012 18:42:12 +0000 (UTC) (envelope-from imp@bsdimp.com)
Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85])
	by mx1.freebsd.org (Postfix) with ESMTP id 628A08FC08;
	Sat,  3 Mar 2012 18:42:09 +0000 (UTC)
Received: from 63.imp.bsdimp.com (63.imp.bsdimp.com [10.0.0.63])
	(authenticated bits=0)
	by harmony.bsdimp.com (8.14.4/8.14.3) with ESMTP id q23Ibnj7081759
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES128-SHA bits=128 verify=NO);
	Sat, 3 Mar 2012 11:37:50 -0700 (MST) (envelope-from imp@bsdimp.com)
Mime-Version: 1.0 (Apple Message framework v1084)
Content-Type: text/plain; charset=us-ascii
From: Warner Losh <imp@bsdimp.com>
In-Reply-To: <201203031902.11035.hselasky@c2i.net>
Date: Sat, 3 Mar 2012 11:37:49 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <DE8376D4-84EE-4B53-8491-4B2203F08274@bsdimp.com>
References: <CAOfDtXNDXV-hM5t56XKj6-m-Bc=SSZsmB7JnEXsoDGdF2DEuqw@mail.gmail.com>
	<201202181720.27135.hselasky@c2i.net>
	<20120303124805.GA4725@thorin>
	<201203031902.11035.hselasky@c2i.net>
To: Hans Petter Selasky <hselasky@c2i.net>
X-Mailer: Apple Mail (2.1084)
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(harmony.bsdimp.com [10.0.0.6]);
	Sat, 03 Mar 2012 11:37:50 -0700 (MST)
Cc: Kostik Belousov <kostikbel@gmail.com>, Adrian Chadd <adrian@FreeBSD.ORG>,
	Robert Millan <rmh@FreeBSD.ORG>, freebsd-usb@FreeBSD.ORG,
	freebsd-arch@FreeBSD.ORG
Subject: Re: Exclude USB drivers from main kernel image?
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2012 18:42:12 -0000


On Mar 3, 2012, at 11:02 AM, Hans Petter Selasky wrote:

> On Saturday 03 March 2012 13:48:05 Robert Millan wrote:
>> On Sat, Feb 18, 2012 at 05:20:27PM +0100, Hans Petter Selasky wrote:
>>> The /etc/devd/usb.conf is regularly updated, though not =
automatically. It
>>> should auto-load most kind of devices. Only additional case that =
comes to
>>> mind is that USB serial console will not be active until devd has
>>> executed, if that is enabled.
>>=20
>> If early USB serial output is desired, it can be enabled by enabling =
the
>> module in bootloader. Is that an acceptable trade-off?
>>=20
>>> Your patch looks OK. Adding ARCH @
>>>=20
>>> Instead of commenting out, I would just remove those lines.
>>=20
>> Here's a new patch that removes the lines instead of commenting them =
out.
>>=20
>> Consistently with that, it also removes a few lines which were =
already
>> commented out, using the same criteria.
>>=20
>> Also, it disables a few more USB drivers. Due to an oversight my =
previous
>> patch didn't disable all drivers that devd can handle.
>>=20
>> Patch is tested with "make universe" on HEAD.
>=20
> Hi,
>=20
> Your patch looks good.
>=20
> Are there any objections committing the patch attached to the previous =
e-mail?

Do all the platforms that had the devices removed work?  Have they all =
been tested?

Warner