From owner-freebsd-arch@FreeBSD.ORG Sun Feb 26 14:02:56 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7ECC6106566C; Sun, 26 Feb 2012 14:02:56 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id 922998FC0C; Sun, 26 Feb 2012 14:02:55 +0000 (UTC) Received: by lagz14 with SMTP id z14so6300630lag.13 for ; Sun, 26 Feb 2012 06:02:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=VkgoBzXDFoYv5pvvupxftiespIM+8ejQqbPGy3xiNOs=; b=A5cAa6FJS0ZJWWEB+OKz7ScLmtoY6HqbjVo3mF5q5iJJkZkhloUbUjAuW3C39T22BK E76eVQy2+tZBQFyckK3sxi2dsF9q19qFrX9daEDNPl9N0PtI7C4hOQ9m4T+Kk6lLH7NJ 7/fW21rhdMBYY6dSc7J+AHo/UqLCYvk/SqpOo= MIME-Version: 1.0 Received: by 10.152.130.234 with SMTP id oh10mr7407327lab.35.1330264974466; Sun, 26 Feb 2012 06:02:54 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Sun, 26 Feb 2012 06:02:54 -0800 (PST) In-Reply-To: <20120225210339.GM55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225210339.GM55074@deviant.kiev.zoral.com.ua> Date: Sun, 26 Feb 2012 15:02:54 +0100 X-Google-Sender-Auth: 1JO8JTL6BqDB7R3Hz9TRRKxu-uQ Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: arch@freebsd.org, Florian Smeets , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Feb 2012 14:02:56 -0000 Il 25 febbraio 2012 22:03, Konstantin Belousov ha scritto: > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek ha scritto: >> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote: >> >> Il 03 febbraio 2012 19:37, Konstantin Belousov ha scritto: >> >> > FreeBSD I/O infrastructure has well known issue with deadlock caused >> >> > by vnode lock order reversal when buffers supplied to read(2) or >> >> > write(2) syscalls are backed by mmaped file. >> >> > >> >> > I previously published the patches to convert i/o path to use VMIO, >> >> > based on the Jeff Roberson proposal, see >> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the >> >> > deadlock. Since that work is very intrusive and did not got any >> >> > follow-up, it get stalled. >> >> > >> >> > Below is very lightweight patch which only goal is to fix deadlock in >> >> > the least intrusive way. This is possible after FreeBSD got the >> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs. >> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch >> >> >> >> Hi, >> >> I was reviewing: >> >> http://people.freebsd.org/~kib/misc/vm1.11.patch >> >> >> >> and I think it is great. It is simple enough and I don't have further >> >> comments on it. > Thank you. > > This spoiled an announce I intended to send this weekend :) > >> >> >> >> However, as a side note, I was thinking if we could get one day at the >> >> point to integrate rangelocks into vnodes lockmgr directly. >> >> It would be a huge patch, rewrtiting the locking of several members of >> >> vnodes likely, but I think it would be worth it in terms of cleaness >> >> of the interface and less overhead. Also, it would be interesting to >> >> consider merging rangelock implementation in ZFS' one, at some point. >> > >> > I personal opinion about rangelocks and many other VFS features we >> > currently have is that it is good idea in theory, but in practise it >> > tends to overcomplicate VFS. >> > >> > I'm in opinion that we should move as much stuff as we can to individual >> > file systems. We try to implement everything in VFS itself in hope that >> > this will simplify file systems we have. It then turns out only one file >> > system is really using this stuff (most of the time it is UFS) and this >> > is PITA for all the other file systems as well as maintaining VFS. VFS >> > became so complicated over the years that there are maybe few people >> > that can understand it, and every single change to VFS is a huge risk of >> > potentially breaking some unrelated parts. >> >> I think this is questionable due to the following assets: >> - If the problem is filesystems writers having trouble in >> understanding the necessary locking we should really provide cleaner >> and more complete documentation. One would think the same with our VM >> subsystem, but at least in that case there is plenty of comments that >> help understanding how to deal with vm_object, vm_pages locking during >> their lifelines. >> - Our primitives may be more complicated than the >> 'all-in-the-filesystem' one, but at least they offer a complete and >> centralized view over the resources we have allocated in the whole >> system and they allow building better policies about how to manage >> them. One problem I see here, is that those policies are not fully >> implemented, tuned or just got outdated, removing one of the highest >> beneficial that we have by making vnodes so generic >> >> About the thing I mentioned myself: >> - As long as the same path now has both range-locking and vnode >> locking I don't see as a good idea to keep both separated forever. >> Merging them seems to me an important evolution (not only helping >> shrinking the number of primitives themselves but also introducing >> less overhead and likely rewamped scalability for vnodes (but I think >> this needs a deep investigation). > The proper direction to move there is to designate the vnode lock for > the vnode structure protection, and have the range lock protect the > i/o atomicity. This is somewhat done in the proposed patch (since > now vnode lock does not protect the i/o operation, but only chunked > i/o transactions inside the operation). > > The Jeff idea of using page cache as the source of i/o data (implemented > in the VM6 patchset) pushes the idea much further. E.g., the write > does not obtain the write vnode lock typically (but sometimes it had, > to extend the vnode). > > Probably, I will revive VM6 after this change is landed. About that I guess we might be careful. The first thing would be having a very scalable VM subsystem and recent benchmarks have shown that this is not yet the case (Florian, CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also with the vmcontention patch, shows a lot on contention on vm_object, pmap lock and vm_page_queue_lock. We have some plans for every of them, we will discuss on a separate thread if you prefer). This is just to say, that we may need more work in underground areas to bring VM6 to the point it will really make a difference. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Sun Feb 26 14:04:21 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5A677106567A; Sun, 26 Feb 2012 14:04:21 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id A44E08FC1C; Sun, 26 Feb 2012 14:04:20 +0000 (UTC) Received: by lagz14 with SMTP id z14so6301703lag.13 for ; Sun, 26 Feb 2012 06:04:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=wyVdTHiwTYnwxOG2gKRHeGl992/uLImcXX664FXO5x4=; b=UYV+J5uilhPOr3P2+S2mNtz/rJfDemGRqhOFIUTY2fryxNgVe26zIwHsNSsSNPa01J JlZO0FOKi6618Z23bpVTlFllkrEY6PdQfGyyJY1Hxefsd+tOQoMKeC59fWWX6T17MQrE b/QBTzTrtmhP7Lf2B1hWxCzFe0LZSAKQLnJco= MIME-Version: 1.0 Received: by 10.112.27.199 with SMTP id v7mr3412896lbg.36.1330265059463; Sun, 26 Feb 2012 06:04:19 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Sun, 26 Feb 2012 06:04:19 -0800 (PST) In-Reply-To: <20120225194630.GI1344@garage.freebsd.pl> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> Date: Sun, 26 Feb 2012 15:04:19 +0100 X-Google-Sender-Auth: o5D1MLltHuoq3NS-UbW0zmBkRC0 Message-ID: From: Attilio Rao To: Pawel Jakub Dawidek Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Konstantin Belousov , arch@freebsd.org Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Feb 2012 14:04:21 -0000 Il 25 febbraio 2012 20:46, Pawel Jakub Dawidek ha scritto= : > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek ha scri= tto: >> > I personal opinion about rangelocks and many other VFS features we >> > currently have is that it is good idea in theory, but in practise it >> > tends to overcomplicate VFS. >> > >> > I'm in opinion that we should move as much stuff as we can to individu= al >> > file systems. We try to implement everything in VFS itself in hope tha= t >> > this will simplify file systems we have. It then turns out only one fi= le >> > system is really using this stuff (most of the time it is UFS) and thi= s >> > is PITA for all the other file systems as well as maintaining VFS. VFS >> > became so complicated over the years that there are maybe few people >> > that can understand it, and every single change to VFS is a huge risk = of >> > potentially breaking some unrelated parts. >> >> I think this is questionable due to the following assets: >> - If the problem is filesystems writers having trouble in >> understanding the necessary locking we should really provide cleaner >> and more complete documentation. One would think the same with our VM >> subsystem, but at least in that case there is plenty of comments that >> help understanding how to deal with vm_object, vm_pages locking during >> their lifelines. > > Documentation is not the answer here. If the code is so complex it is > harder to learn, no matter how good the documentation is, it makes less > people willing to learn it in the first place and it makes the code more > buggy, because there are more edge/special cases you can forget about. > >> - Our primitives may be more complicated than the >> 'all-in-the-filesystem' one, but at least they offer a complete and >> centralized view over the resources we have allocated in the whole >> system and they allow building better policies about how to manage >> them. One problem I see here, is that those policies are not fully >> implemented, tuned or just got outdated, removing one of the highest >> beneficial that we have by making vnodes so generic > > Again, this is only nice theory, that is far from being the reality. > You will never be able to have control on all the resources allocated by > file systems. > >> About the thing I mentioned myself: >> - As long as the same path now has both range-locking and vnode >> locking I don't see as a good idea to keep both separated forever. >> Merging them seems to me an important evolution (not only helping >> shrinking the number of primitives themselves but also introducing >> less overhead and likely rewamped scalability for vnodes (but I think >> this needs a deep investigation). >> - About ZFS rangelocks absorbing the VFS ones, I think this is a minor >> point, but still, if you think it can be done efficiently and without >> loosing performance I don't see why not do that. You already wrote >> rangelocks for ZFS, so you are have earned a big experience in this >> area and can comment on fallouts, etc., but I don't see a good reason >> to not do that, unless it is just too difficult. This is not about >> generalizing a new mechanism, it is using a general mechanism in a >> specific implementation, if possible. > > I did not implement rangelocking for ZFS. It came with ZFS when I ported > it. Until we want to merge changes from upstream (which is now IllumOS) > we don't want to make huge changes just for the sake of proving that > this is general purpose mechanism used by more than one file system. > > Attilio, don't get me wrong. In 99% cases it is good to make code more > general and more universal and reusable, but we can't ignore reality. > > There are reasons why file systems like XFS, ReiserFS and others where > never fully ported. I'm not saying VFS complexity was the only reason, > but I'm sure it was one of them. > > Our VFS is very UFS-centric. We make so many assumptions that sounds > fine only for UFS. I saw plenty of those while working on ZFS, like: > > - "Every file system needs cache. Let's make it general, so that all file > =C2=A0systems can use it!" Well, for VFS each file system is a separate > =C2=A0entity, which is not the case for ZFS. ZFS can cache one block only > =C2=A0once that is used by one file system, 10 clones and 100 snapshots, > =C2=A0which all are separate mount points from VFS perspective. > =C2=A0The same block would be cached 111 times by the buffer cache. > > - "rmdir(2) on a mountpoint is bad idea, let's deny it at VFS level." > =C2=A0It is bad idea, indeed, but in ZFS it is a nice way to remove snaps= hot > =C2=A0by rmdiring .zfs/snapshot/ directory. > > - Noone implemented rangelocking in VFS, so no file system can use it. > =C2=A0Even if the given file system has all the code to do it. > > etc. > > I'm also sure it will be way easier for Jeff to make VFS MP-safe if it > was less complex. > > When looking at the big picture, it would be nice to have all this > general stuff like rangelocking, quota, buffer cache, etc. as some kind > of libraries for file systems to use and not something that is > mandatory. If I develop a file system for FreeBSD only and I don't want > to reinvent the wheel, I can use those libraries. If I port file system > to FreeBSD or develop a file system that doesn't really need those > libraries I'm not forced to use them. > > All this might make a good working group subject at BSDCan devsummit. > We could cross swords there:) Do you think you will be able to chair such a group? I'm not sure I will be able to make it for BSDCan, but it would be valuable if you or someone else interested can let the ball roll on these topics. Thanks, Attilio --=20 Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Sun Feb 26 14:13:39 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 68652106564A; Sun, 26 Feb 2012 14:13:39 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id ECF8D8FC0A; Sun, 26 Feb 2012 14:13:38 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q1QEDYW4068821; Sun, 26 Feb 2012 16:13:34 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q1QEDYZe027147; Sun, 26 Feb 2012 16:13:34 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q1QEDYC5027146; Sun, 26 Feb 2012 16:13:34 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 26 Feb 2012 16:13:34 +0200 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120226141334.GU55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225210339.GM55074@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="hZWqkIq97iJ4fJXE" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org, Florian Smeets , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Feb 2012 14:13:39 -0000 --hZWqkIq97iJ4fJXE Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote: > Il 25 febbraio 2012 22:03, Konstantin Belousov ha s= critto: > > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: > >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek ha sc= ritto: > >> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote: > >> >> Il 03 febbraio 2012 19:37, Konstantin Belousov ha scritto: > >> >> > FreeBSD I/O infrastructure has well known issue with deadlock cau= sed > >> >> > by vnode lock order reversal when buffers supplied to read(2) or > >> >> > write(2) syscalls are backed by mmaped file. > >> >> > > >> >> > I previously published the patches to convert i/o path to use VMI= O, > >> >> > based on the Jeff Roberson proposal, see > >> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the > >> >> > deadlock. Since that work is very intrusive and did not got any > >> >> > follow-up, it get stalled. > >> >> > > >> >> > Below is very lightweight patch which only goal is to fix deadloc= k in > >> >> > the least intrusive way. This is possible after FreeBSD got the > >> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) K= PIs. > >> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch > >> >> > >> >> Hi, > >> >> I was reviewing: > >> >> http://people.freebsd.org/~kib/misc/vm1.11.patch > >> >> > >> >> and I think it is great. It is simple enough and I don't have furth= er > >> >> comments on it. > > Thank you. > > > > This spoiled an announce I intended to send this weekend :) > > > >> >> > >> >> However, as a side note, I was thinking if we could get one day at = the > >> >> point to integrate rangelocks into vnodes lockmgr directly. > >> >> It would be a huge patch, rewrtiting the locking of several members= of > >> >> vnodes likely, but I think it would be worth it in terms of cleaness > >> >> of the interface and less overhead. Also, it would be interesting to > >> >> consider merging rangelock implementation in ZFS' one, at some poin= t. > >> > > >> > I personal opinion about rangelocks and many other VFS features we > >> > currently have is that it is good idea in theory, but in practise it > >> > tends to overcomplicate VFS. > >> > > >> > I'm in opinion that we should move as much stuff as we can to indivi= dual > >> > file systems. We try to implement everything in VFS itself in hope t= hat > >> > this will simplify file systems we have. It then turns out only one = file > >> > system is really using this stuff (most of the time it is UFS) and t= his > >> > is PITA for all the other file systems as well as maintaining VFS. V= FS > >> > became so complicated over the years that there are maybe few people > >> > that can understand it, and every single change to VFS is a huge ris= k of > >> > potentially breaking some unrelated parts. > >> > >> I think this is questionable due to the following assets: > >> - If the problem is filesystems writers having trouble in > >> understanding the necessary locking we should really provide cleaner > >> and more complete documentation. One would think the same with our VM > >> subsystem, but at least in that case there is plenty of comments that > >> help understanding how to deal with vm_object, vm_pages locking during > >> their lifelines. > >> - Our primitives may be more complicated than the > >> 'all-in-the-filesystem' one, but at least they offer a complete and > >> centralized view over the resources we have allocated in the whole > >> system and they allow building better policies about how to manage > >> them. One problem I see here, is that those policies are not fully > >> implemented, tuned or just got outdated, removing one of the highest > >> beneficial that we have by making vnodes so generic > >> > >> About the thing I mentioned myself: > >> - As long as the same path now has both range-locking and vnode > >> locking I don't see as a good idea to keep both separated forever. > >> Merging them seems to me an important evolution (not only helping > >> shrinking the number of primitives themselves but also introducing > >> less overhead and likely rewamped scalability for vnodes (but I think > >> this needs a deep investigation). > > The proper direction to move there is to designate the vnode lock for > > the vnode structure protection, and have the range lock protect the > > i/o atomicity. This is somewhat done in the proposed patch (since > > now vnode lock does not protect the i/o operation, but only chunked > > i/o transactions inside the operation). > > > > The Jeff idea of using page cache as the source of i/o data (implemented > > in the VM6 patchset) pushes the idea much further. E.g., the write > > does not obtain the write vnode lock typically (but sometimes it had, > > to extend the vnode). > > > > Probably, I will revive VM6 after this change is landed. >=20 > About that I guess we might be careful. > The first thing would be having a very scalable VM subsystem and > recent benchmarks have shown that this is not yet the case (Florian, > CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also > with the vmcontention patch, shows a lot on contention on vm_object, > pmap lock and vm_page_queue_lock. We have some plans for every of > them, we will discuss on a separate thread if you prefer). This is > just to say, that we may need more work in underground areas to bring > VM6 to the point it will really make a difference. The benchmarks that were done at that time demonstrated that VM6 do not cause regressions for e.g. buildworld time, and have a margin improvements, around 10%, for some postgresql loads. Main benefit of the VM6 on UFS is that writers no longer block readers for separate i/o ranges. Also, due to vm_page flags locking improvements, I suspect the VM6 backpressure code might be simplified and give even larger benefit right now. Anyway, I do not think that VM6 can be put into HEAD quickly, and I want to finish with VM1/prefaulting right now. --hZWqkIq97iJ4fJXE Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9KPg4ACgkQC3+MBN1Mb4jgeQCgmjogiXqR8U7bZcOJ50tiEfb1 vi4An0XaOgTsNFD0GGIGbVqPw0kOUB+I =ykEh -----END PGP SIGNATURE----- --hZWqkIq97iJ4fJXE-- From owner-freebsd-arch@FreeBSD.ORG Sun Feb 26 14:16:15 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 50903106564A; Sun, 26 Feb 2012 14:16:15 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id 59C518FC13; Sun, 26 Feb 2012 14:16:13 +0000 (UTC) Received: by lagz14 with SMTP id z14so6311048lag.13 for ; Sun, 26 Feb 2012 06:16:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=oxsmmjsYVnKNZW19L/RK7s/wIqD4CQWUvJtpJUxc9Ro=; b=eTuD+09O3JgWRhyGXnOxHroAwGZXR1PbZbW5qEp23SYyYDt0PF+wGILGCgeX5j6+IL 9a8WTNzCHy2HyRLCgJZ6holL2beygXqv/aedCFdB5px1L9BUPkcI9PGnCX8MO7O4Xnhs /Jy+GEcDu/0QNzMUtUtbLA0irmSZYkPlsoXxU= MIME-Version: 1.0 Received: by 10.112.10.41 with SMTP id f9mr3456684lbb.8.1330265772956; Sun, 26 Feb 2012 06:16:12 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Sun, 26 Feb 2012 06:16:12 -0800 (PST) In-Reply-To: <20120226141334.GU55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225210339.GM55074@deviant.kiev.zoral.com.ua> <20120226141334.GU55074@deviant.kiev.zoral.com.ua> Date: Sun, 26 Feb 2012 14:16:12 +0000 X-Google-Sender-Auth: JiD3cWG7nSA1ai8Zuaj3MJFHco4 Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: arch@freebsd.org, Florian Smeets , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Feb 2012 14:16:15 -0000 Il 26 febbraio 2012 14:13, Konstantin Belousov ha scritto: > On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote: >> Il 25 febbraio 2012 22:03, Konstantin Belousov ha scritto: >> > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: >> >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek ha scritto: >> >> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote: >> >> >> Il 03 febbraio 2012 19:37, Konstantin Belousov ha scritto: >> >> >> > FreeBSD I/O infrastructure has well known issue with deadlock caused >> >> >> > by vnode lock order reversal when buffers supplied to read(2) or >> >> >> > write(2) syscalls are backed by mmaped file. >> >> >> > >> >> >> > I previously published the patches to convert i/o path to use VMIO, >> >> >> > based on the Jeff Roberson proposal, see >> >> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the >> >> >> > deadlock. Since that work is very intrusive and did not got any >> >> >> > follow-up, it get stalled. >> >> >> > >> >> >> > Below is very lightweight patch which only goal is to fix deadlock in >> >> >> > the least intrusive way. This is possible after FreeBSD got the >> >> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs. >> >> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch >> >> >> >> >> >> Hi, >> >> >> I was reviewing: >> >> >> http://people.freebsd.org/~kib/misc/vm1.11.patch >> >> >> >> >> >> and I think it is great. It is simple enough and I don't have further >> >> >> comments on it. >> > Thank you. >> > >> > This spoiled an announce I intended to send this weekend :) >> > >> >> >> >> >> >> However, as a side note, I was thinking if we could get one day at the >> >> >> point to integrate rangelocks into vnodes lockmgr directly. >> >> >> It would be a huge patch, rewrtiting the locking of several members of >> >> >> vnodes likely, but I think it would be worth it in terms of cleaness >> >> >> of the interface and less overhead. Also, it would be interesting to >> >> >> consider merging rangelock implementation in ZFS' one, at some point. >> >> > >> >> > I personal opinion about rangelocks and many other VFS features we >> >> > currently have is that it is good idea in theory, but in practise it >> >> > tends to overcomplicate VFS. >> >> > >> >> > I'm in opinion that we should move as much stuff as we can to individual >> >> > file systems. We try to implement everything in VFS itself in hope that >> >> > this will simplify file systems we have. It then turns out only one file >> >> > system is really using this stuff (most of the time it is UFS) and this >> >> > is PITA for all the other file systems as well as maintaining VFS. VFS >> >> > became so complicated over the years that there are maybe few people >> >> > that can understand it, and every single change to VFS is a huge risk of >> >> > potentially breaking some unrelated parts. >> >> >> >> I think this is questionable due to the following assets: >> >> - If the problem is filesystems writers having trouble in >> >> understanding the necessary locking we should really provide cleaner >> >> and more complete documentation. One would think the same with our VM >> >> subsystem, but at least in that case there is plenty of comments that >> >> help understanding how to deal with vm_object, vm_pages locking during >> >> their lifelines. >> >> - Our primitives may be more complicated than the >> >> 'all-in-the-filesystem' one, but at least they offer a complete and >> >> centralized view over the resources we have allocated in the whole >> >> system and they allow building better policies about how to manage >> >> them. One problem I see here, is that those policies are not fully >> >> implemented, tuned or just got outdated, removing one of the highest >> >> beneficial that we have by making vnodes so generic >> >> >> >> About the thing I mentioned myself: >> >> - As long as the same path now has both range-locking and vnode >> >> locking I don't see as a good idea to keep both separated forever. >> >> Merging them seems to me an important evolution (not only helping >> >> shrinking the number of primitives themselves but also introducing >> >> less overhead and likely rewamped scalability for vnodes (but I think >> >> this needs a deep investigation). >> > The proper direction to move there is to designate the vnode lock for >> > the vnode structure protection, and have the range lock protect the >> > i/o atomicity. This is somewhat done in the proposed patch (since >> > now vnode lock does not protect the i/o operation, but only chunked >> > i/o transactions inside the operation). >> > >> > The Jeff idea of using page cache as the source of i/o data (implemented >> > in the VM6 patchset) pushes the idea much further. E.g., the write >> > does not obtain the write vnode lock typically (but sometimes it had, >> > to extend the vnode). >> > >> > Probably, I will revive VM6 after this change is landed. >> >> About that I guess we might be careful. >> The first thing would be having a very scalable VM subsystem and >> recent benchmarks have shown that this is not yet the case (Florian, >> CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also >> with the vmcontention patch, shows a lot on contention on vm_object, >> pmap lock and vm_page_queue_lock. We have some plans for every of >> them, we will discuss on a separate thread if you prefer). This is >> just to say, that we may need more work in underground areas to bring >> VM6 to the point it will really make a difference. > > The benchmarks that were done at that time demonstrated that VM6 do not > cause regressions for e.g. buildworld time, and have a margin improvements, > around 10%, for some postgresql loads. > > Main benefit of the VM6 on UFS is that writers no longer block readers > for separate i/o ranges. Also, due to vm_page flags locking improvements, > I suspect the VM6 backpressure code might be simplified and give even > larger benefit right now. > > Anyway, I do not think that VM6 can be put into HEAD quickly, and I want > to finish with VM1/prefaulting right now. I was speaking about a different benchmark. Florian made a lock_profile/hwpmc analysis on stock + vmcontention patch for verifying where the biggest bottlenecks are. Of course, it turns out that the most contended locks are all the ones involved in VM, which is not surprising at all. He can share numbers and insight I guess. Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Sun Feb 26 14:22:07 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 941C81065670; Sun, 26 Feb 2012 14:22:07 +0000 (UTC) (envelope-from flo@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 0D1138FC0C; Sun, 26 Feb 2012 14:22:07 +0000 (UTC) Received: from nibbler-osx.fritz.box (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.5/8.14.5) with ESMTP id q1QEM4dd009659; Sun, 26 Feb 2012 14:22:05 GMT (envelope-from flo@FreeBSD.org) Message-ID: <4F4A400C.1030606@FreeBSD.org> Date: Sun, 26 Feb 2012 15:22:04 +0100 From: Florian Smeets User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20120216 Thunderbird/11.0 MIME-Version: 1.0 To: Attilio Rao References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225210339.GM55074@deviant.kiev.zoral.com.ua> <20120226141334.GU55074@deviant.kiev.zoral.com.ua> In-Reply-To: X-Enigmail-Version: 1.4a1pre Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigBB70FF77484EE4C06FD5CE12" Cc: Konstantin Belousov , arch@FreeBSD.org, Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Feb 2012 14:22:07 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigBB70FF77484EE4C06FD5CE12 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 26.02.12 15:16, Attilio Rao wrote: > Il 26 febbraio 2012 14:13, Konstantin Belousov ha= scritto: >> On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote: >>> Il 25 febbraio 2012 22:03, Konstantin Belousov = ha scritto: >>>> On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: >>>>> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek ha= scritto: >>>>>> On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote: >>>>>>> Il 03 febbraio 2012 19:37, Konstantin Belousov ha scritto: >>>>>>>> FreeBSD I/O infrastructure has well known issue with deadlock ca= used >>>>>>>> by vnode lock order reversal when buffers supplied to read(2) or= >>>>>>>> write(2) syscalls are backed by mmaped file. >>>>>>>> >>>>>>>> I previously published the patches to convert i/o path to use VM= IO, >>>>>>>> based on the Jeff Roberson proposal, see >>>>>>>> http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the= >>>>>>>> deadlock. Since that work is very intrusive and did not got any >>>>>>>> follow-up, it get stalled. >>>>>>>> >>>>>>>> Below is very lightweight patch which only goal is to fix deadlo= ck in >>>>>>>> the least intrusive way. This is possible after FreeBSD got the >>>>>>>> vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) = KPIs. >>>>>>>> http://people.freebsd.org/~kib/misc/vm1.3.patch >>>>>>> >>>>>>> Hi, >>>>>>> I was reviewing: >>>>>>> http://people.freebsd.org/~kib/misc/vm1.11.patch >>>>>>> >>>>>>> and I think it is great. It is simple enough and I don't have fur= ther >>>>>>> comments on it. >>>> Thank you. >>>> >>>> This spoiled an announce I intended to send this weekend :) >>>> >>>>>>> >>>>>>> However, as a side note, I was thinking if we could get one day a= t the >>>>>>> point to integrate rangelocks into vnodes lockmgr directly. >>>>>>> It would be a huge patch, rewrtiting the locking of several membe= rs of >>>>>>> vnodes likely, but I think it would be worth it in terms of clean= ess >>>>>>> of the interface and less overhead. Also, it would be interesting= to >>>>>>> consider merging rangelock implementation in ZFS' one, at some po= int. >>>>>> >>>>>> I personal opinion about rangelocks and many other VFS features we= >>>>>> currently have is that it is good idea in theory, but in practise = it >>>>>> tends to overcomplicate VFS. >>>>>> >>>>>> I'm in opinion that we should move as much stuff as we can to indi= vidual >>>>>> file systems. We try to implement everything in VFS itself in hope= that >>>>>> this will simplify file systems we have. It then turns out only on= e file >>>>>> system is really using this stuff (most of the time it is UFS) and= this >>>>>> is PITA for all the other file systems as well as maintaining VFS.= VFS >>>>>> became so complicated over the years that there are maybe few peop= le >>>>>> that can understand it, and every single change to VFS is a huge r= isk of >>>>>> potentially breaking some unrelated parts. >>>>> >>>>> I think this is questionable due to the following assets: >>>>> - If the problem is filesystems writers having trouble in >>>>> understanding the necessary locking we should really provide cleane= r >>>>> and more complete documentation. One would think the same with our = VM >>>>> subsystem, but at least in that case there is plenty of comments th= at >>>>> help understanding how to deal with vm_object, vm_pages locking dur= ing >>>>> their lifelines. >>>>> - Our primitives may be more complicated than the >>>>> 'all-in-the-filesystem' one, but at least they offer a complete and= >>>>> centralized view over the resources we have allocated in the whole >>>>> system and they allow building better policies about how to manage >>>>> them. One problem I see here, is that those policies are not fully >>>>> implemented, tuned or just got outdated, removing one of the highes= t >>>>> beneficial that we have by making vnodes so generic >>>>> >>>>> About the thing I mentioned myself: >>>>> - As long as the same path now has both range-locking and vnode >>>>> locking I don't see as a good idea to keep both separated forever. >>>>> Merging them seems to me an important evolution (not only helping >>>>> shrinking the number of primitives themselves but also introducing >>>>> less overhead and likely rewamped scalability for vnodes (but I thi= nk >>>>> this needs a deep investigation). >>>> The proper direction to move there is to designate the vnode lock fo= r >>>> the vnode structure protection, and have the range lock protect the >>>> i/o atomicity. This is somewhat done in the proposed patch (since >>>> now vnode lock does not protect the i/o operation, but only chunked >>>> i/o transactions inside the operation). >>>> >>>> The Jeff idea of using page cache as the source of i/o data (impleme= nted >>>> in the VM6 patchset) pushes the idea much further. E.g., the write >>>> does not obtain the write vnode lock typically (but sometimes it had= , >>>> to extend the vnode). >>>> >>>> Probably, I will revive VM6 after this change is landed. >>> >>> About that I guess we might be careful. >>> The first thing would be having a very scalable VM subsystem and >>> recent benchmarks have shown that this is not yet the case (Florian, >>> CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also >>> with the vmcontention patch, shows a lot on contention on vm_object, >>> pmap lock and vm_page_queue_lock. We have some plans for every of >>> them, we will discuss on a separate thread if you prefer). This is >>> just to say, that we may need more work in underground areas to bring= >>> VM6 to the point it will really make a difference. >> >> The benchmarks that were done at that time demonstrated that VM6 do no= t >> cause regressions for e.g. buildworld time, and have a margin improvem= ents, >> around 10%, for some postgresql loads. >> >> Main benefit of the VM6 on UFS is that writers no longer block readers= >> for separate i/o ranges. Also, due to vm_page flags locking improvemen= ts, >> I suspect the VM6 backpressure code might be simplified and give even >> larger benefit right now. >> >> Anyway, I do not think that VM6 can be put into HEAD quickly, and I wa= nt >> to finish with VM1/prefaulting right now. >=20 > I was speaking about a different benchmark. > Florian made a lock_profile/hwpmc analysis on stock + vmcontention > patch for verifying where the biggest bottlenecks are. > Of course, it turns out that the most contended locks are all the ones > involved in VM, which is not surprising at all. >=20 > He can share numbers and insight I guess. All i did until now is run PostgreSQL with 128 client threads with lock_profiling [1] and hwpmc [2]. I haven't spent any time analyzing this, yet. [1] http://people.freebsd.org/~flo/vmc-lock-profiling-postgres-128-20120208.t= xt [2] http://people.freebsd.org/~flo/vmc-hwpmc-gprof-postgres-128-20120208.= txt --------------enigBB70FF77484EE4C06FD5CE12 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAk9KQAwACgkQapo8P8lCvwl0mgCg2+4H30fWR7qt3g6iIxlYN28W iNIAn2b6unvHqHukMX+Tdp8rtgn/4TP2 =jfVO -----END PGP SIGNATURE----- --------------enigBB70FF77484EE4C06FD5CE12-- From owner-freebsd-arch@FreeBSD.ORG Sun Feb 26 14:52:22 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8926C1065672; Sun, 26 Feb 2012 14:52:22 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 01CA58FC08; Sun, 26 Feb 2012 14:52:21 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q1QEqHsn071070; Sun, 26 Feb 2012 16:52:17 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q1QEqHNc027413; Sun, 26 Feb 2012 16:52:17 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q1QEqH7W027412; Sun, 26 Feb 2012 16:52:17 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 26 Feb 2012 16:52:17 +0200 From: Konstantin Belousov To: arch@freebsd.org Message-ID: <20120226145217.GV55074@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="UjTPyfxZRWWsQMwo" Content-Disposition: inline User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Subject: Prefaulting for i/o buffers: v2.0 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 26 Feb 2012 14:52:22 -0000 --UjTPyfxZRWWsQMwo Content-Type: text/plain; charset=us-ascii Content-Disposition: inline I started a new thread since I do not want this message to interfere with some side discussions caused by my initial letter. I continued development and refinement of the original patch. The latest version is available at http://people.freebsd.org/~kib/misc/vm1.11.patch. Since first announce, intermediate versions of the patch were reviewed by attilio, mdf and pjd. A version of the patch was tested and benchmarked by flo, which apparently shown no difference for a postgresql benchmark. I fixed several shameful bugs, in particular, buildworld now successfully finishes on the patched kernel and working directory both on UFS and newnfs mounts. Apparently, bsdar(1) provides an excellent functional test for the patch. Most significant difference with previous variants is that now the use of prefaulting is opt-in. I discovered that typical filesystem does not handle uiomove() errors gracefully. Only UFS and newnfs are switched to use prefaulting. The newnfs client was changed to properly handle uiomove() failures and to not cause user data loss on EFAULT (this is also applicable for the stock svn sources). Corresponding changes were reviewed by rmacklem. My own feel is that vm1.11.patch is ready to be committed. This is a notification to allow more people to take a look and provide the pre-commit opinions. Thanks. --UjTPyfxZRWWsQMwo Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9KRyEACgkQC3+MBN1Mb4gGIQCgpj2DAyn+TCL0qAeENcoSahWs wG0Anj/7RrFNMnQsngTAokRw27yduMk3 =sknu -----END PGP SIGNATURE----- --UjTPyfxZRWWsQMwo-- From owner-freebsd-arch@FreeBSD.ORG Wed Feb 29 06:41:08 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C5D2C1065679 for ; Wed, 29 Feb 2012 06:41:08 +0000 (UTC) (envelope-from delphij@delphij.net) Received: from anubis.delphij.net (anubis.delphij.net [IPv6:2001:470:1:117::25]) by mx1.freebsd.org (Postfix) with ESMTP id AB8A28FC17 for ; Wed, 29 Feb 2012 06:41:08 +0000 (UTC) Received: from delta.delphij.net (unknown [IPv6:2001:470:83bf:0:221:5cff:fe6a:37bb]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (No client certificate requested) by anubis.delphij.net (Postfix) with ESMTPSA id 14F3AF9A5; Tue, 28 Feb 2012 22:41:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=delphij.net; s=anubis; t=1330497668; bh=tKYlgOcAgDQ6XcYBEJ6T/VLDqQIabAuSUz1m1UAFerA=; h=Message-ID:Date:From:Reply-To:MIME-Version:To:CC:Subject: Content-Type:Content-Transfer-Encoding; b=nCRw5QIXTTzPt2NWSKibm7+SPUlmzzyOUex9v3e8tI2FJ7PsIw1gmL6Kzc8mkOuzY ZVPykFN4skpC4lzjxW0JWZMl1kqwgfIUCRHyPqje4KITGrTGyWbAoii/9LxduKoK60 91cV8PELmMHEYMCugqRG3FK8L26yN6pexPTOkiGw= Message-ID: <4F4DC876.3010809@delphij.net> Date: Tue, 28 Feb 2012 22:40:54 -0800 From: Xin Li Organization: The FreeBSD Project MIME-Version: 1.0 To: freebsd-arch@freebsd.org X-Enigmail-Version: 1.3.5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: d@delphij.net Subject: RFC: futimens(2) and utimensat(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: d@delphij.net List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Feb 2012 06:41:08 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hi, These are required by IEEE Std 1003.1-2008. Patchset at: http://people.freebsd.org/~delphij/for_review/utimens.diff Cheers, - -- Xin LI https://www.delphij.net/ FreeBSD - The Power to Serve! Live free or die -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (FreeBSD) iQEcBAEBCAAGBQJPTch1AAoJEG80Jeu8UPuz75sH/3nyv6Lgdfa5MoF335u6H9zS kjyBroOlV3pJ/2V2d+77fw/qt5PmMG+jPwVwCrQ55+ZuntG9wvrT+UNnY67lzV55 /otFzF8a6onvpe8HSX7JJOh6neeN8njQzfJDClbDPFJKKm778Qfebjes1s0zk1tp JOvCf8bstXy02s0833sRW3HsfOh19f2KEPmKo2PXwgSrTGsLOWQqS7heFhszY5Hi woRkxs9RYRzs1i3MzkBSDYB+KTOV6H+SUBln6w/HudHMBjvdvlUxpEpHjOzqbhax bDE4QDljWY+3WK71Y48zEoEWO1P+jrbyciceIAWNF4RKmjSMeHMbnnTCFZFe+ZE= =FUtH -----END PGP SIGNATURE----- From owner-freebsd-arch@FreeBSD.ORG Wed Feb 29 11:51:23 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B33111065673; Wed, 29 Feb 2012 11:51:23 +0000 (UTC) (envelope-from pluknet@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id 02C0D8FC0A; Wed, 29 Feb 2012 11:51:22 +0000 (UTC) Received: by lagv3 with SMTP id v3so490337lag.13 for ; Wed, 29 Feb 2012 03:51:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=bntZeHk/+b/QEeAFwvT25WYIoly9nbR/TtiQT3Zr+Ek=; b=ZnJsDcD4bH2/5eBEGDtCIdAb+Ey33uUeZ5NSRjMOK1x8/+YzUeOFx9RQFdBzoqpHEC bieC7ZFA1wGu+xtT+MR8/khf+pvQ5sh99wWcfKbv+Tl2IsMBHrLL0TmKrLbIwycifH13 S0cyg8K+IAEdr84siHX4kC/veIEhF0QCRw/Q8= MIME-Version: 1.0 Received: by 10.152.135.148 with SMTP id ps20mr14686919lab.20.1330514483732; Wed, 29 Feb 2012 03:21:23 -0800 (PST) Received: by 10.152.108.204 with HTTP; Wed, 29 Feb 2012 03:21:23 -0800 (PST) In-Reply-To: <4F4DC876.3010809@delphij.net> References: <4F4DC876.3010809@delphij.net> Date: Wed, 29 Feb 2012 14:21:23 +0300 Message-ID: From: Sergey Kandaurov To: d@delphij.net Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Jilles Tjoelker , freebsd-arch@freebsd.org Subject: Re: RFC: futimens(2) and utimensat(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Feb 2012 11:51:23 -0000 On 29 February 2012 10:40, Xin Li wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > Hi, > > These are required by IEEE Std 1003.1-2008. =A0Patchset at: > > http://people.freebsd.org/~delphij/for_review/utimens.diff > First, thank you very much for doing this. ERRORS section for utimes(2) is still not updated (not exists). Funny but that was the most difficult part to implement these syscalls a year ago with the great help from jilles@. He could further comment on your patchset. Otherwise looks good and pretty similar to my work, though I didn't use a "const" modifier in my version for both functions and syscall definitions in syscall.master for some reasons. Further I wrote a test to see how properly implementation detects EACCES/EPERM with different UTIME_OMIT/UTIME_NOW passed. It shall pass all tests as shown in the table (stolen somewhere from austingroup): [a] [b] [c] times file file arg. UID is NULL owner writable Result !NULL !owner !writable N o w success N o !w success N ! w success N !o !w EACCES [1] !N o w success !N o !w success !N !o w EPERM [2] !N !o !w EPERM [3] Here NULL also covers cases when: - both fields are UTIME_NULL - both fields are UTIME_OMIT. Ok, lets see how it does: 1) Given: UTIME_NOW UTIME_NOW o w gives: success expected: success 2) Given: UTIME_NOW UTIME_NOW o !w gives: success expected: success 3) Given: UTIME_NOW UTIME_NOW !o w gives: EPERM expected: success 4) Given: UTIME_NOW UTIME_NOW !o !w gives: EPERM expected: EACCES 5) Given: (NULL) (NULL) o w gives: success expected: success 6) Given: (NULL) (NULL) o !w gives: success expected: success 7) Given: (NULL) (NULL) !o w gives: success expected: success 8) Given: (NULL) (NULL) !o !w gives: EACCES expected: EACCES 9) Given: (number) (number) o w gives: success expected: success 10) Given: (number) (number) o !w gives: success expected: success 11) Given: (number) (number) !o w gives: EPERM expected: EPERM 12) Gives: (number) (number) !o !w gives: EPERM expected: EPERM So, your version doesn't differentiate the case with both UTIME_NULL as a special case when you need to grant caller more privileges as if this was the case with both NULL pointers. My version handles this. Your version uses two calls to vfs_timestamp() in different condition branches. It could be done just once. My version of getutimens() is more complicated but it handles the case with both UTIME_NOW. This is the older version last time discussed with jilles. It misses man page update and compat32 parts (both were done since then except missing ERROR section in utimes(2). e.g. my compat32 version is just as yours :)). I started to commit my version (you can see r227447) but failed due to missing ERROR section, my lack of english to rewrite utimes(2) man page, and too complicated and wrong ERROR section in the existing utimes(2). http://plukky.net/~pluknet/patches/utimes.2008.3.diff It is pretty similar to your except I done getutimens() a bit different. I had to introduce such complication to pass all tests. Take note on private flags UTIMENS_NULL and UTIMENS_EXIT. Index: sys/kern/vfs_syscalls.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/kern/vfs_syscalls.c (revision 220831) +++ sys/kern/vfs_syscalls.c (working copy) @@ -94,6 +94,8 @@ static int chroot_refuse_vdir_fds(struct filedesc *fdp); static int getutimes(const struct timeval *, enum uio_seg, struct timespec= *); +static int getutimens(const struct timespec *, enum uio_seg, + struct timespec *, int *); static int setfown(struct thread *td, struct vnode *, uid_t, gid_t); static int setfmode(struct thread *td, struct vnode *, int); static int setfflags(struct thread *td, struct vnode *, int); @@ -3162,9 +3164,61 @@ } /* - * Common implementation code for utimes(), lutimes(), and futimes(). + * Common implementation code for futimens(), utimensat(). */ +#define UTIMENS_NULL 0x1 +#define UTIMENS_EXIT 0x2 static int +getutimens(usrtsp, tspseg, tsp, retflags) + const struct timespec *usrtsp; + enum uio_seg tspseg; + struct timespec *tsp; + int *retflags; +{ + int error; + struct timespec tsnow; + + vfs_timestamp(&tsnow); + *retflags =3D 0; + if (usrtsp =3D=3D NULL) { + tsp[0] =3D tsnow; + tsp[1] =3D tsnow; + *retflags |=3D UTIMENS_NULL; + return (0); + } + if (tspseg =3D=3D UIO_SYSSPACE) { + tsp[0] =3D usrtsp[0]; + tsp[1] =3D usrtsp[1]; + } else if ((error =3D copyin(usrtsp, tsp, sizeof(*tsp) * 2)) !=3D 0) + return (error); + + if (tsp[0].tv_nsec =3D=3D UTIME_OMIT && tsp[1].tv_nsec =3D=3D UTIME_OMIT) + *retflags |=3D UTIMENS_EXIT; + if (tsp[0].tv_nsec =3D=3D UTIME_NOW && tsp[1].tv_nsec =3D=3D UTIME_NOW) + *retflags |=3D UTIMENS_NULL; + + if (tsp[0].tv_nsec =3D=3D UTIME_OMIT) + tsp[0].tv_sec =3D VNOVAL; + else if (tsp[0].tv_nsec =3D=3D UTIME_NOW) + tsp[0] =3D tsnow; + else if (tsp[0].tv_nsec < 0 || tsp[0].tv_nsec >=3D 1000000000L) + return (EINVAL); + + if (tsp[1].tv_nsec =3D=3D UTIME_OMIT) + tsp[1].tv_sec =3D VNOVAL; + else if (tsp[1].tv_nsec =3D=3D UTIME_NOW) + tsp[1] =3D tsnow; + else if (tsp[1].tv_nsec < 0 || tsp[1].tv_nsec >=3D 1000000000L) + return (EINVAL); + + return (0); +} + +/* + * Common implementation code for utimes(), lutimes(), futimes(), futimens= (), + * and utimensat(). + */ +static int setutimes(td, vp, ts, numtimes, nullflag) struct thread *td; struct vnode *vp; @@ -3362,6 +3416,94 @@ return (error); } +#ifndef _SYS_SYSPROTO_H_ +struct futimens_args { + int fd; + struct timespec *times; +}; +#endif +int +futimens(struct thread *td, struct futimens_args *uap) +{ + + return (kern_futimens(td, uap->fd, uap->times, UIO_USERSPACE)); +} + +int +kern_futimens(struct thread *td, int fd, struct timespec *tptr, + enum uio_seg tptrseg) +{ + struct timespec ts[2]; + struct file *fp; + int error, flags, vfslocked; + + AUDIT_ARG_FD(fd); + if ((error =3D getutimens(tptr, tptrseg, ts, &flags)) !=3D 0) + return (error); + if (flags & UTIMENS_EXIT) + return (0); + if ((error =3D getvnode(td->td_proc->p_fd, fd, &fp)) !=3D 0) + return (error); + vfslocked =3D VFS_LOCK_GIANT(fp->f_vnode->v_mount); +#ifdef AUDIT + vn_lock(fp->f_vnode, LK_SHARED | LK_RETRY); + AUDIT_ARG_VNODE1(fp->f_vnode); + VOP_UNLOCK(fp->f_vnode, 0); +#endif + error =3D setutimes(td, fp->f_vnode, ts, 2, flags & UTIMENS_NULL); + VFS_UNLOCK_GIANT(vfslocked); + fdrop(fp, td); + return (error); +} + +#ifndef _SYS_SYSPROTO_H_ +struct utimensat_args { + int fd; + const char *path; + const struct timespec *times; + int flag; +}; +#endif +int +utimensat(struct thread *td, struct utimensat_args *uap) +{ + + return (kern_utimensat(td, uap->fd, uap->path, UIO_USERSPACE, + uap->times, UIO_USERSPACE, uap->flag)); +} + +int +kern_utimensat(struct thread *td, int fd, char *path, enum uio_seg pathseg= , + struct timespec *tptr, enum uio_seg tptrseg, int flag) +{ + struct nameidata nd; + struct timespec ts[2]; + int error, flags, vfslocked; + + if (flag & ~AT_SYMLINK_NOFOLLOW) + return (EINVAL); + + if ((error =3D getutimens(tptr, tptrseg, ts, &flags)) !=3D 0) + return (error); + NDINIT_AT(&nd, LOOKUP, ((flag & AT_SYMLINK_NOFOLLOW) ? NOFOLLOW : + FOLLOW) | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); + if ((error =3D namei(&nd)) !=3D 0) + return (error); + /* + * We are allowed to call namei() regardless of 2xUTIME_OMIT. + * POSIX states: + * "If both tv_nsec fields are UTIME_OMIT... EACCESS may be detected." + * "Search permission is denied by a component of the path prefix." + */ + vfslocked =3D NDHASGIANT(&nd); + NDFREE(&nd, NDF_ONLY_PNBUF); + if ((flags & UTIMENS_EXIT) =3D=3D 0) + error =3D setutimes(td, nd.ni_vp, ts, 2, flags & UTIMENS_NULL); + vrele(nd.ni_vp); + VFS_UNLOCK_GIANT(vfslocked); + return (error); +} + /* * Truncate a file given its path name. */ --=20 wbr, pluknet From owner-freebsd-arch@FreeBSD.ORG Wed Feb 29 12:04:47 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 32A4C106566B for ; Wed, 29 Feb 2012 12:04:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 9521A8FC1B for ; Wed, 29 Feb 2012 12:04:46 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q1TC4VLB084001; Wed, 29 Feb 2012 14:04:31 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q1TC4Vrd094904; Wed, 29 Feb 2012 14:04:31 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q1TC4V88094903; Wed, 29 Feb 2012 14:04:31 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 29 Feb 2012 14:04:31 +0200 From: Konstantin Belousov To: d@delphij.net Message-ID: <20120229120431.GX55074@deviant.kiev.zoral.com.ua> References: <4F4DC876.3010809@delphij.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="3BXhlsXkTW/kybY4" Content-Disposition: inline In-Reply-To: <4F4DC876.3010809@delphij.net> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: freebsd-arch@freebsd.org Subject: Re: RFC: futimens(2) and utimensat(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Feb 2012 12:04:47 -0000 --3BXhlsXkTW/kybY4 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Feb 28, 2012 at 10:40:54PM -0800, Xin Li wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 >=20 > Hi, >=20 > These are required by IEEE Std 1003.1-2008. Patchset at: >=20 > http://people.freebsd.org/~delphij/for_review/utimens.diff The patch looks fine, I have only some stylistic comments. You misordered the functions both in Symbol.map and in the man page. The kern_utimensat() definition would benefit from making the second line of the function shorter then 80 columns. I suggest to use a local struct vnode *vp variable instead of dereferencing fp->f_vnode on each line. Put error and vfslocked declarations in kern_futimens on the same line. I do not see a need in having _SYS_SYSPROTO_H_ for new syscalls. We always do have sysproto.h. And, omiting the generated files from the patch would make it easier to rea= d. --3BXhlsXkTW/kybY4 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9OFE8ACgkQC3+MBN1Mb4hykQCg27zu75p6Z/Uj4K7YlW6E6V0C NNoAn3RtY6vmZj1K61oVfOQuM6c5trM3 =O1m5 -----END PGP SIGNATURE----- --3BXhlsXkTW/kybY4-- From owner-freebsd-arch@FreeBSD.ORG Wed Feb 29 13:08:43 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 26661106566C; Wed, 29 Feb 2012 13:08:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au [211.29.132.190]) by mx1.freebsd.org (Postfix) with ESMTP id AFCF98FC18; Wed, 29 Feb 2012 13:08:41 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q1TD8SGV017994 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 1 Mar 2012 00:08:29 +1100 Date: Thu, 1 Mar 2012 00:08:28 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Sergey Kandaurov In-Reply-To: Message-ID: <20120229232250.G3812@besplex.bde.org> References: <4F4DC876.3010809@delphij.net> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-1303333959-1330520908=:3812" Cc: Jilles Tjoelker , d@delphij.net, freebsd-arch@FreeBSD.org Subject: Re: RFC: futimens(2) and utimensat(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Feb 2012 13:08:43 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-1303333959-1330520908=:3812 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Wed, 29 Feb 2012, Sergey Kandaurov wrote: > On 29 February 2012 10:40, Xin Li wrote: >> >> These are required by IEEE Std 1003.1-2008. =A0Patchset at: >> >> http://people.freebsd.org/~delphij/for_review/utimens.diff I didn't look at this because it wasn't in the mail :-). > This is the older version last time discussed with jilles. > It misses man page update and compat32 parts (both were > done since then except missing ERROR section in utimes(2). > e.g. my compat32 version is just as yours :)). > I started to commit my version (you can see r227447) but > failed due to missing ERROR section, my lack of english to > rewrite utimes(2) man page, and too complicated and wrong > ERROR section in the existing utimes(2). > > http://plukky.net/~pluknet/patches/utimes.2008.3.diff > > It is pretty similar to your except I done getutimens() a bit different. > I had to introduce such complication to pass all tests. > Take note on private flags UTIMENS_NULL and UTIMENS_EXIT. > > Index: sys/kern/vfs_syscalls.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- sys/kern/vfs_syscalls.c=09(revision 220831) > +++ sys/kern/vfs_syscalls.c=09(working copy) > ... > static int > +getutimens(usrtsp, tspseg, tsp, retflags) > +=09const struct timespec *usrtsp; > +=09enum uio_seg tspseg; > +=09struct timespec *tsp; > +=09int *retflags; Should probably not use K&R function definitions in new code. > +{ > +=09int error; > +=09struct timespec tsnow; Structs should be sorted before scalars (and pointers). > + > +=09vfs_timestamp(&tsnow); Not used in all paths. > +=09*retflags =3D 0; Not used in all paths? > +=09if (usrtsp =3D=3D NULL) { > +=09=09tsp[0] =3D tsnow; > +=09=09tsp[1] =3D tsnow; > +=09=09*retflags |=3D UTIMENS_NULL; > +=09=09return (0); > +=09} > +=09if (tspseg =3D=3D UIO_SYSSPACE) { > +=09=09tsp[0] =3D usrtsp[0]; > +=09=09tsp[1] =3D usrtsp[1]; > +=09} else if ((error =3D copyin(usrtsp, tsp, sizeof(*tsp) * 2)) !=3D 0) > +=09=09=09return (error); Indentation. > + Extra blank line. Many more of these below. > +=09if (tsp[0].tv_nsec =3D=3D UTIME_OMIT && tsp[1].tv_nsec =3D=3D UTIME_O= MIT) > +=09=09*retflags |=3D UTIMENS_EXIT; > +=09if (tsp[0].tv_nsec =3D=3D UTIME_NOW && tsp[1].tv_nsec =3D=3D UTIME_NO= W) > +=09=09*retflags |=3D UTIMENS_NULL; > + > +=09if (tsp[0].tv_nsec =3D=3D UTIME_OMIT) > +=09=09tsp[0].tv_sec =3D VNOVAL; tsp[0].tv_nsec is not initialized (except it is UTIME_OMIT, which might be the same as VNOVAL). The patch seems to be missing the header part that defines UTIME_OMIT). Most setattr vnops are sloppy about checking both tv_sec and tv_nsec, but VATTR_NULL() sets both to VNOVAL for setattrs that don't request a time change. More care is actually required in the opposite direction -- getattr defaults va_birthtime. tv_sec.tv_nsec to -1.0, so that when a getattr doesn't understand birthime it comes back back unchanged as -1.0 which gives the error value (time_t)-1. All attributes for getattr should be defaulted like this so that all file systems don't have to know about them, but only va_birthtime, va_fsid and va_rdev are (all the others default to stack garbage). > +=09else if (tsp[0].tv_nsec =3D=3D UTIME_NOW) > +=09=09tsp[0] =3D tsnow; > +=09else if (tsp[0].tv_nsec < 0 || tsp[0].tv_nsec >=3D 1000000000L) > +=09=09return (EINVAL); > + > +=09if (tsp[1].tv_nsec =3D=3D UTIME_OMIT) > +=09=09tsp[1].tv_sec =3D VNOVAL; > +=09else if (tsp[1].tv_nsec =3D=3D UTIME_NOW) > +=09=09tsp[1] =3D tsnow; > +=09else if (tsp[1].tv_nsec < 0 || tsp[1].tv_nsec >=3D 1000000000L) > +=09=09return (EINVAL); Is it possible to extend this API to support birthtimes (and with more security control, ctimes)? Encoding more in tv_nsec should do it. Certain magic values in tsp[1].tv_nsec would indicate that there are more than 2 entries in tsp[]. An extra copyin is needed to read the extra entries (after reading tsp[1] to see if there are more). Better add this before the ABI solidifies. This would have worked for utimes() too, with with magic in tsp[1].tv_usec, but this seems unnecessary now. Bruce --0-1303333959-1330520908=:3812-- From owner-freebsd-arch@FreeBSD.ORG Wed Feb 29 19:41:20 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8B9301065675 for ; Wed, 29 Feb 2012 19:41:20 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 323EB8FC08 for ; Wed, 29 Feb 2012 19:41:19 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 6FCC17300B; Wed, 29 Feb 2012 20:40:42 +0100 (CET) Date: Wed, 29 Feb 2012 20:40:42 +0100 From: Luigi Rizzo To: arch@freebsd.org Message-ID: <20120229194042.GA10921@onelab2.iet.unipi.it> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="EVF5PPMfhYS0aIcm" Content-Disposition: inline User-Agent: Mutt/1.4.2.3i X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: Subject: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Feb 2012 19:41:20 -0000 --EVF5PPMfhYS0aIcm Content-Type: text/plain; charset=us-ascii Content-Disposition: inline I have always been annoyed by the fact that FreeBSD rounds timeouts in select/usleep/poll in very conservative ways, so i decided to try how other systems behave in this respect. Attached is a simple program that you should be able to compile and run on various OS and see what happens. Here are the results (HZ=1000 on the system under test, and FreeBSD has the same behaviour since at least 4.11): | Actual timeout | select | poll | usleep| timeout | FBSD | Linux | OSX | FBSD | FBSD | usec | 9.0 | Vbox | 10.6 | 9.0 | 9.0 | --------+-------+-------+--------+-------+-------+ 1 2000 99 6 0 2000 10 2000 109 15 0 2000 50 2000 149 66 0 2000 100 2000 196 133 0 2000 500 2000 597 617 0 2000 1000 2000 1103 1136 2000 2000 1001 3000 1103 1136 2000 3000 <--- 1500 3000 1608 1631 2000 3000 <--- 2000 3000 2096 2127 3000 3000 2001 4000 3000 4000 <--- 3001 5000 4000 5000 <--- Note how the rounding (poll has the timeout in milliseconds) affects the actual timeouts when you are past multiples of 1/HZ. I know that until we have some hi-res interrupt source there is no hope to have better than 1/HZ granularity. However we are doing much worse by adding up to 2 extra ticks. This makes apps less responsive than they could be, and gives us no way to "yield until the next tick". So what I would like to do is add a sysctl (disabled by default) that enables a better approximation of the desired delay. I see in the kernel that all three syscalls loop around a blocking function (tsleep or seltdwait), and do check the "actual" elapsed time by calling getmicrouptime() or getnanouptime() around the sleeping function . So the actual timeout passed to tsleep does not really matter (as long as it is greater than 0 ). The only concern is that getmicrouptime()/getnanouptime() are documented as "less precise, but faster to obtain". The question is how precise is "less precise": do we have some way to get an upper bound for the precision of the timers used in get*time(), so we can use that value in the equation instead of the extra 1/HZ that tvtohz() puts in after computing floor(timeout*HZ) ? For reference, below is the core of usleep and select/poll (from kern_time.c and sys_generic.c) usleep: getnanouptime(now) end = now + timeout; for (;;) { getnanouptime(now); delta = end - now; if (delta <= 0) break; tsleep(..., tvtohz(delta) ) } select/poll: itimerfix(timeout) // force at least 1/HZ getmicrouptime(now) end = now + timeout; for (;;) { delta = end - now; seltdwait(..., tvtohz(delta) ) getmicrouptime(now); if (some_fd_is_ready() || now >= end) break; } --- cheers luigi --EVF5PPMfhYS0aIcm-- From owner-freebsd-arch@FreeBSD.ORG Wed Feb 29 20:55:20 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0FBDC106566C for ; Wed, 29 Feb 2012 20:55:20 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 886938FC08 for ; Wed, 29 Feb 2012 20:55:19 +0000 (UTC) Received: by bkcjc3 with SMTP id jc3so4715377bkc.13 for ; Wed, 29 Feb 2012 12:55:03 -0800 (PST) Received-SPF: pass (google.com: domain of mavbsd@gmail.com designates 10.205.135.132 as permitted sender) client-ip=10.205.135.132; Authentication-Results: mr.google.com; spf=pass (google.com: domain of mavbsd@gmail.com designates 10.205.135.132 as permitted sender) smtp.mail=mavbsd@gmail.com; dkim=pass header.i=mavbsd@gmail.com Received: from mr.google.com ([10.205.135.132]) by 10.205.135.132 with SMTP id ig4mr1154425bkc.20.1330548903811 (num_hops = 1); Wed, 29 Feb 2012 12:55:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=T5Im6v+xo0fI0dI+03awtElalqqQwepSlZObycirdlo=; b=fa3o4zqZJzdDRaznhAmZn4eXOICsphMEFcWxZMGqDQiPZjdVsE3JcRtsP/Z1NL8j3q 0rOMiACY/L+vDOs0zWIqpoFVRaiU9iBhj798yIZtFtdwyRyZ+F908mLA7w+Omwo+ffkp qToTeKhUS41YPFFvAyKCMOjKaYGXBCcQ2Ke4g= Received: by 10.205.135.132 with SMTP id ig4mr901293bkc.20.1330547439306; Wed, 29 Feb 2012 12:30:39 -0800 (PST) Received: from mavbook.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226]) by mx.google.com with ESMTPS id x22sm39515997bkw.11.2012.02.29.12.30.37 (version=SSLv3 cipher=OTHER); Wed, 29 Feb 2012 12:30:38 -0800 (PST) Sender: Alexander Motin Message-ID: <4F4E8AE4.6080705@FreeBSD.org> Date: Wed, 29 Feb 2012 22:30:28 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:9.0) Gecko/20120116 Thunderbird/9.0 MIME-Version: 1.0 To: Luigi Rizzo References: In-Reply-To: Content-Type: text/plain; charset=KOI8-R; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@FreeBSD.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Feb 2012 20:55:20 -0000 On 29.02.2012 21:40, Luigi Rizzo wrote: > I have always been annoyed by the fact that FreeBSD rounds timeouts > in select/usleep/poll in very conservative ways, so i decided to > try how other systems behave in this respect. Attached is a simple > program that you should be able to compile and run on various OS > and see what happens. > > Here are the results (HZ=1000 on the system under test, and FreeBSD > has the same behaviour since at least 4.11): > > | Actual timeout > | select | poll | usleep| > timeout | FBSD | Linux | OSX | FBSD | FBSD | > usec | 9.0 | Vbox | 10.6 | 9.0 | 9.0 | > --------+-------+-------+--------+-------+-------+ > 1 2000 99 6 0 2000 > 10 2000 109 15 0 2000 > 50 2000 149 66 0 2000 > 100 2000 196 133 0 2000 > 500 2000 597 617 0 2000 > 1000 2000 1103 1136 2000 2000 > 1001 3000 1103 1136 2000 3000<--- > 1500 3000 1608 1631 2000 3000<--- > 2000 3000 2096 2127 3000 3000 > 2001 4000 3000 4000<--- > 3001 5000 4000 5000<--- > > > Note how the rounding (poll has the timeout in milliseconds) affects > the actual timeouts when you are past multiples of 1/HZ. > > I know that until we have some hi-res interrupt source there is no > hope to have better than 1/HZ granularity. However we are doing > much worse by adding up to 2 extra ticks. This makes apps less > responsive than they could be, and gives us no way to > "yield until the next tick". > > So what I would like to do is add a sysctl (disabled by > default) that enables a better approximation of the desired delay. > > I see in the kernel that all three syscalls loop around a blocking > function (tsleep or seltdwait), and do check the "actual" elapsed > time by calling getmicrouptime() or getnanouptime() around the > sleeping function . So the actual timeout passed to tsleep does > not really matter (as long as it is greater than 0 ). > > The only concern is that getmicrouptime()/getnanouptime() are documented > as "less precise, but faster to obtain". The question is how precise is > "less precise": do we have some way to get an upper bound for the > precision of the timers used in get*time(), so we can use that value > in the equation instead of the extra 1/HZ that tvtohz() puts in > after computing floor(timeout*HZ) ? "less precise" there means they are updated on hardclock() invocation every 1/HZ. > For reference, below is the core of usleep and select/poll > (from kern_time.c and sys_generic.c) > > usleep: > getnanouptime(now) > end = now + timeout; > for (;;) { > getnanouptime(now); > delta = end - now; > if (delta<= 0) > break; > tsleep(..., tvtohz(delta) ) > } > > select/poll: > itimerfix(timeout) // force at least 1/HZ > getmicrouptime(now) > end = now + timeout; > for (;;) { > delta = end - now; > seltdwait(..., tvtohz(delta) ) > getmicrouptime(now); > if (some_fd_is_ready() || now>= end) > break; > } > -- Alexander Motin From owner-freebsd-arch@FreeBSD.ORG Wed Feb 29 23:17:52 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E282310656A9 for ; Wed, 29 Feb 2012 23:17:51 +0000 (UTC) (envelope-from info@targipolitalia.com) Received: from smtplq03.aruba.it (smtplqs-out16.aruba.it [62.149.158.56]) by mx1.freebsd.org (Postfix) with SMTP id 3BC2D8FC1F for ; Wed, 29 Feb 2012 23:17:51 +0000 (UTC) Received: (qmail 21607 invoked by uid 89); 29 Feb 2012 22:51:09 -0000 Received: from unknown (HELO smtp1.aruba.it) (62.149.158.221) by smtplq03.aruba.it with SMTP; 29 Feb 2012 22:51:09 -0000 Received: (qmail 21217 invoked by uid 89); 29 Feb 2012 22:51:09 -0000 Received: from unknown (HELO DARIUSZTRZASKA1) (info@targipolitalia.com@151.50.30.42) by smtp1.ad.aruba.it with SMTP; 29 Feb 2012 22:51:09 -0000 From: "Dariusz Trzaska" To: "freebsd-arch" MIME-Version: 1.0 Organization: www.targipolitalia.com Date: Wed, 29 Feb 2012 23:50:57 +0100 X-Antivirus: avast! (VPS 120229-1, 2012-02-29), Outbound message X-Antivirus-Status: Clean X-Spam-Rating: smtp1.ad.aruba.it 1.6.2 0/1000/N X-Spam-Rating: smtplq03.aruba.it 1.6.2 0/1000/N Message-Id: <20120229231751.E282310656A9@hub.freebsd.org> Content-Type: text/plain; charset="iso-8859-2" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: =?iso-8859-2?q?Nowa_wiadomo=B6=E6?= X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Feb 2012 23:17:52 -0000 We invite you to visit our newly formed and remodeled website: =20 http://www.targipolitalia.com =20 The world's first international exhibition on-line now active. Free re= gistration gives the possibility of advertising, which can lead to the= development and success of the company. We designed a "VIRTUAL FAIR B= oxes" run functions that operate after registering and logging in, for= all firms and individuals registered on the portal. You can place ads= in all languages, such as: I am looking for customers, suppliers, con= tractors as well as an investor, partner, etc. Similarly, you can brow= se proposals from other companies. We are open for cooperation, as wel= l as suggestions on how to further improve the functioning of site. Greetings and welcome to register a company as well as private individ= uals. =20 Dariusz Trzaska Electronic signature no. 287732/CCK/2011 Mob. +39 3806460196 E-mail: info@targipolitalia.com http://www.targipolitalia.com =20 From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 00:33:52 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 06A10106564A for ; Thu, 1 Mar 2012 00:33:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 36A208FC17 for ; Thu, 1 Mar 2012 00:33:50 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q210Xkbe009834 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 1 Mar 2012 11:33:48 +1100 Date: Thu, 1 Mar 2012 11:33:46 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Luigi Rizzo In-Reply-To: <20120229194042.GA10921@onelab2.iet.unipi.it> Message-ID: <20120301071145.O879@besplex.bde.org> References: <20120229194042.GA10921@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 00:33:52 -0000 On Wed, 29 Feb 2012, Luigi Rizzo wrote: > I have always been annoyed by the fact that FreeBSD rounds timeouts > in select/usleep/poll in very conservative ways, so i decided to > try how other systems behave in this respect. Attached is a simple > program that you should be able to compile and run on various OS > and see what happens. Many are broken, indeed. The simple program isn't attached. > Here are the results (HZ=1000 on the system under test, and FreeBSD > has the same behaviour since at least 4.11): > > | Actual timeout > | select | poll | usleep| > timeout | FBSD | Linux | OSX | FBSD | FBSD | > usec | 9.0 | Vbox | 10.6 | 9.0 | 9.0 | > --------+-------+-------+--------+-------+-------+ > 1 2000 99 6 0 2000 Try HZ = 20 (possible, at the user's option, even with an i8254 timer) or lower (possible, at the user's option, with better timers). FreeBSD should then get timeouts of up to 2/HZ = 100000 us. Applications must deal with this range somehow (maybe by telling the user to configure HZ better). In ttcp, I found the timeouts unusable and resorted to an option to busy-wait. The rate-limiting timeouts in tools/netrate don't work at all for small HZ, and barely work for large HZ, since timeouts are similarly unusable. It is possible to easily improve this to a maximum of only 1/HZ = 50000 us, at some cost to efficiency, by waking up 1 tick early, checking if the timeout has expired, and sleeping for another tick if it hasn't. Waking up early is needed anyway for long timeouts, in case the timeouts interrupts are running a little slower than their estimated frequency -- being off by just 1 part per million will accumulate to an error of 86400 us after 1 day, and FreeBSD cluser machines used to be off by about 10%, giving an error of 2.4 hours for your appointment next day. The error will often be 100's of parts per million, giving an error of 10's of seconds per day. To handle the 10% error, timeouts must wake up 10% early. select(), poll(), and nanosleep() all check that the timeout expires when they wake up, but they don't set it to be wake up early, and their check for whether it has expired is even fuzzier than the timeout granularity (it uses the broken-as-designed getnanouptime() API to get the current time; timecounters are only updated every $(sysctl kern.timecounter.tick) ticks, by default to limit their update frequency to about 1000 Hz when HZ is configured to be much larger than 1000. This ensures an extra, unnecessary innaccuracy of up to 1000 us whenever the timeout wakes up a little early according the the retarded clock used to measure the current time. I think the error is fail-safe -- it may extend the timeout by as much as 1000 usec. > 10 2000 109 15 0 2000 > 50 2000 149 66 0 2000 > 100 2000 196 133 0 2000 > 500 2000 597 617 0 2000 You must have synced with timer interrupts to get the above. Timeouts in the current FreeBSD implementation should average the actual timeout rounded up to a multiple of 1/HZ seconds, plus 0.5/HZ seconds, and thus average 1.5/HZ = 1500 us for short timeouts. Someone apparently broke poll() on FreeBSD :-(. Linux and OSX must be using busy-waiting or expensive timer reprogramming for short timeouts to work. Linux-2.6.10 has 3491 references to udelay(). This seems to correspond to FreeBSD' DELAY(). The linux nanosleep() code is too complicated for me to easily see what it is doing for short timeouts, but I noticed that it isn't missing clock_nanosleep() like FreeBSD does, and presumably has a collaterally non-broken nanosleep(). (POSIX requires nanosleep() to sleep in real time, to be bug for bug compatible with old sleep(), but this is often not what is wanted. So POSIX invented clock_nanosleep() so as to be able to sleep on the monotonic clock and on any other clock of interest. FreeBSD doesn't know anything about this, and only has nanosleep(), which sleeps on a wrong clock (the monotonic one). > 1000 2000 1103 1136 2000 2000 > 1001 3000 1103 1136 2000 3000 <--- > 1500 3000 1608 1631 2000 3000 <--- > 2000 3000 2096 2127 3000 3000 > 2001 4000 3000 4000 <--- > 3001 5000 4000 5000 <--- > > Note how the rounding (poll has the timeout in milliseconds) affects > the actual timeouts when you are past multiples of 1/HZ. Also, timeouts that are just before a multiple of 1/HZ may be turned into 1/HZ + 1000 usec by the inacurracy of getnanouptime(). E.g., if the requested timeout is 999, HZ is 1000, and tc_tick is 1, 999 should be turned into 2 ticks (average 1500 usec). Then, if the timeout for the second tick is a little early according to the retarded clock, the total timeout will be extended by another tick, to 3 ticks. So all timeouts may be extended by up to 2 ticks, instead of only ones just larger than a multiple of 1/HZ being extended by the full 2 ticks. This can be made much worse by setting tc_tick to a large value. > I know that until we have some hi-res interrupt source there is no > hope to have better than 1/HZ granularity. However we are doing > much worse by adding up to 2 extra ticks. This makes apps less > responsive than they could be, and gives us no way to > "yield until the next tick". > > So what I would like to do is add a sysctl (disabled by > default) that enables a better approximation of the desired delay. It is possible to get a timeout every tick in userland using periodic itimer's. Maybe not for the initial timeout, but after that the timeouts repeat with the specified period (rounded up to the next tick boundary, but not up to the next + 1). > I see in the kernel that all three syscalls loop around a blocking > function (tsleep or seltdwait), and do check the "actual" elapsed > time by calling getmicrouptime() or getnanouptime() around the > sleeping function . So the actual timeout passed to tsleep does > not really matter (as long as it is greater than 0 ). > > The only concern is that getmicrouptime()/getnanouptime() are documented > as "less precise, but faster to obtain". The question is how precise is > "less precise": do we have some way to get an upper bound for the It is tc_tick/HZ seconds (usually 1/HZ, but if you set HZ to be > 1000 to reduce this problem, then tc_tick will bite you unless you change it down. Both may be acceptable on a real-timeish system that wants short timeouts at the cost of efficiency. But I don't like timeouts. When they are used a lot, they are a form of busy waiting. This is only acceptable if you have CPU to burn). > precision of the timers used in get*time(), so we can use that value > in the equation instead of the extra 1/HZ that tvtohz() puts in > after computing floor(timeout*HZ) ? It's always worse than 1/HZ if you use these broken-as-designed APIs. > For reference, below is the core of usleep and select/poll > (from kern_time.c and sys_generic.c) > > usleep: > getnanouptime(now) Should use nanotime(). This also fixes the clock id. But check whether this is really required by POSIX. It means that if someone steps the clock, then the sleep may be extended or truncated significantly. You also need to fix the timeout used here to sleep in real time, so that it doesn't wake up late by an amount of (negative of the step). Timeouts also sleep in monotonic time. Sleeping in monotonic time seems to be wrong in more cases than it is correct. Another broken area is suspend/resume. Suspending for an hour stops all timeouts by an hour. Then on resume, they aren't adusted, so they occur an hour later. There is still code in kern_timeout.c under APM_FIXUP_CALLTODO which was supposed to fix this problem for apm, but it is so poorly maintained that it never even compiled in any committed version, and there is no option for it. Problems in this area were supposed to have been fixed by using monotonic time more, but they seem to have actually been increased. > end = now + timeout; This is quite broken too: - it doesn't arrange to wake up early. Sutract 1 tick here for a quick fix for short timeouts. Maybe 10% for timeouts longer than 10 ticks. - it can overflow, giving undefined behaviour (in practice, just bad results later) - itimerfix() is supposed to be used to prevent such overflows, but this is quite broken: - note that itimerfix() is quite different from timevalfix(). Although its name is spelled with an 'i' and a 'fix', itimerfix() is useful generally and doesn't fix anything except for bogusly adjusting fractional ticks. It is also confusing because its name is spelled without a `val', since it was intended to be used only for itimers which used to use generic times (timevals) back before better representations of times existed. What it mosly does is validity checking for timevals. OTOH, timevalfix() does pure fixing (to handle carry after (possibly multiple) additions and subtractions). - timevalfix() used to limit tv_sec to 100 million seconds (else EINVAL), but someone broken it by removing this check - someone didn't update the man pages which document this limit. It is still documented in at least: - setitimer(2). This is its primary API - alarm(3) - ualarm(3). This also bogusly documents what a microsecond is Grepping for 100000000 and 100.*million also shows many bad descriptions of the limit on tv_nsec for APIs that take a timespec arg. This limit is described verbosely as "1000 million". Of course, "1 billion" cannot be used since it is ambiguous, and 1000000000 should not be used because it is hard to see the number of zeros in it, but millions aren't naturally associated with the nano prefix. I like to write such numbers in minimal floating point scientific notation (e.g., 1e9) or as powers of 10 (e.g., 10**9). 1e9 is better because it is shorter and doesn't need an ambiguous '^' operator or a less common Fortran '**' operator for exponentiation. Not sure if this is best for man pages. select() and kqueue() used to have the same limit, but it was never documented for them. Not sure about kqueue. - this limit isn't permitted by POSIX. - here we have timespecs. itimespecfix() exists too (although timespecfix() doesn't). itimespecfix() has the same semantics as itimerfix() (except of course it obviously acts on timespecs while itimerfix() unobviously acts on timevals). itimespecfix() never had the limit on tv_sec. But timespecfix() has the same bogus rounding up of fractional ticks as timerfix(). This is only done for fractional ticks below 1. This should be unnecessary provided everywhere else is careful to round up and usually to add 1. timespec*fix() never did the adding 1 part. I think they are just defending against sloppy conversions that produce 0 ticks from small but nonzero timeouts. A timeout of 0 ticks means to sleep forever. But relevant higher levels also defend against this, by silently changing 0 ticks to 1 tick. - back to bugs at the level of nanosleep(). It can't use itimerfix() since it deals with timespecs. It should call something like itimespecfix() (except that should be named timespeccheck()...). But IIRC, itimespecfix() didn't exist when nanosleep() was implemented. Also, itimespecfix() has wrong semantics for use here.. Its bogus rounding up is exactly what you don't want. I think it has other slightly mismatched semantics (perhaps a difference in error numbers), but the others are easy to fix up. So nanosleep() rolled its own checking, and got it wrong. The overflow seems easy to fix as a side affect of waking up early: - for preposterosterously long timeouts, wake up after 100 million seconds or similar, instead of 10% early. This delays the problem. 100 million seconds is a little over 3 years, so it won't expire in practice, and no one would care if it did. But this method (and the old limit) breaks down about 3 years before the time_t's roll over. So in 2036, you need to limit the timeout to only 2 years instead of 3, if you are still using 31-bit (sic) time_t's with a useless sign bit then. And in 2105, you need to limit the timeout to only 2 years instead of 3, if you are still using 32-bit unsigned time_t's then. Be careful with overflow even with this fix. Applications probing for kernel bugs will try using maximal tv_sec. Since POSIX doesn't allow rejecting these like the old 100 million second limit did, we must start with a long sleep and retain the original preposterous timeout so that we can return it as the unslept time. We have to be careful about overflow when adding the preposterous time to the current time. Large time_t's don't do anything to limit this overflow, since the appllication can ask for the maximum (2**31-1 or 2**63-1 in practice), and since the current time is surely >= 1, adding the current time to the preposterous time surely gives overflow. Example of a not-unreasonable POSIX application to probe for bugs in this area: set tv.tv_sec and tv.tv_nsec to the maximum possible ("infinity") arrange for a signal after 1 second nanosleep(&tv, &tv2); check that tv2 is about 1 second below the maximum possible > for (;;) { > getnanouptime(now); Our original `now' and thus `end' are retarded by up to tc_tick/HZ. This `now' is retarded too. This complicates the analysis and changes its results. > delta = end - now; So `delta' might not even be retarded. If `end' is normal but `now' is retarded, then `delta' is advanced. This case is fail-safe but not what you want (sleep again). If `end' is retarded but `now' is normal, then the retardation in `delta' is maximal (still less than tc_tick/HZ). This case is fail-unsafe (return up to tc_tick/HZ early). Other cases are in between these, with the retardations partially or completely cancelling. By making tc_tick large, the fail-unsafe case can be made to more than overcome the safety margin of about 1 1/HZ given by always adding 1. I only just noticed this detail. > if (delta <= 0) > break; > tsleep(..., tvtohz(delta) ) > } > > select/poll: > itimerfix(timeout) // force at least 1/HZ That's "bogusly force". It doesn't add 1 or do anything if the timeout is above 1/HZ, but both are done later. > getmicrouptime(now) > end = now + timeout; Same retardation and overflow bugs. > for (;;) { Missing showing getmicrouptime(now) here? > delta = end - now; > seltdwait(..., tvtohz(delta) ) tvtohz() does round up and add 1. Its interaction with the above is unclear. I think there is double rounding up in some cases. For example, even without the retardation, a delta that wants to be precisely 1 tick (much more than it should be due to the first rounding up) may be 1 us over 1 tick due to minor inaccuracies. Then tvtohz() will round up again. nanosleep() has the same problem. To handle this, we may need to subtract 1 more from the result of tvtohz(): - always subtract 10% - subtract 1 to compensate for tvtohz() always adding 1. See the periodic itimer code for this fixup. Periodic itimers need to be more careful for long timeouts too. - subtract 1 more in case delta is a little too large. Better look at delta and not always do this. - if the resulting timeout is <= 0, change it to 1. > getmicrouptime(now); > if (some_fd_is_ready() || now >= end) > break; > } Activity on the fd's is likely to give more wakeups than nanosleep() gets, since the latter only gets woken up for its own timeout and signals. For this and other reasons, large timeouts are probably smaller for select() than for nanosleep(). And for poll(), you just can't ask for a large timeout (the limit is (2**31-1) milliseconds ~= 24.8 days with 32-bit ints, as is the case on all supported arches). It remains to explain why the above results show that poll() but not select() is broken for small timeouts (they are turned into 0 us for poll() and 2000 us for select()). Well, the granularity for poll is 1 ms, so this looks like just an application bug, with timeouts of < 1 ms being rounded down to 0 before the kernel sees them. But how do the other OS's see it? This might be due to them taking a long time to handle null timeouts, and their times actually being reported correctly. I don't believe the times of 0 and 2000 us reported for FreeBSD. You can't do anything in 0 us, and 2000 us is too round a number. These round numbers might be due to using the broken as designed CLOCK_MONOTONIC_FAST_N_BROKEN clock ids. These are collateral with getnanouptime() etc. They are even more broken as designed, since provided you have non-slow timecounter hardware, the time for clock_gettime() is dominated by syscall overhead, so CLOCK_MONOTONIC_FAST_N_BROKEN is only a few percent faster than CLOCK_MONOTIC. Typical numbers are: - 12 (9?) cycles for the hardware part of a TSC timecounter on an old Athlon64. ~250 cycles for the total syscall overhead for an old version of FreeBSD UP on almost any x86. Possible savings from the "fast" method: about 5%. More like 10% due to extra software overhead for the timecounter. - 42 (?) cycles for the hardware part of a TSC timecounter on Phenom+ (synchronization across CPUs makes it much slower). Similarly for most modern multi-core CPUs. Intel CPUs were much slower than 12 cycles even for old single-core ones. ~350 cycles for the total syscall overhead for a current version of FreeBSD SMP on almost any x86. Possible savings from the "fast" method: about 15%. IIRC, SMP only costs 20-30 cycles, with the extra 100 being mainly from extra layers. - 1000-2000 nsec (up to ~8000 cycles) for an ACPI-FAST timecounter. These are actually ACPI-SLOW. Now the hardware overhead dominates. HPET is better, but has become common at the same time as the TSC became usable for SMP, so it is rarely useful as a timecounter. - up to 5000 nsec for an i8254 timecounter on a modern CPU. Getting slow with more briges between the CPU and the ISA bus. - up to 30000 nsec for an i8254 timecounter on a 486. OTOH, getnanouptime() takes about 12 cycles (very fuzzy estimate), so it is 2-3 times faster than nanouptime() using the fastest TSC hardware, and about 5-6 times faster than nanouptime() using slower TSC hardware. IIRC, itimer code doesn't do these checks of the time after wakeups at all. Not sure what kqueue does. I haven't really touched nanosleep(), but have some small fixes near the tvtohz() call for select() and poll(). % Index: sys_generic.c % =================================================================== % RCS file: /home/ncvs/src/sys/kern/sys_generic.c,v % retrieving revision 1.131 % diff -u -2 -r1.131 sys_generic.c % --- sys_generic.c 5 Apr 2004 21:03:35 -0000 1.131 % +++ sys_generic.c 13 Aug 2009 11:21:29 -0000 % @@ -806,9 +797,5 @@ % getmicrouptime(&rtv); % timevaladd(&atv, &rtv); % - } else { % - atv.tv_sec = 0; % - atv.tv_usec = 0; % } % - timo = 0; % TAILQ_INIT(&td->td_selq); % mtx_lock(&sellock); % @@ -824,5 +811,7 @@ % if (error || td->td_retval[0]) % goto done; % - if (atv.tv_sec || atv.tv_usec) { % + if (tvp == NULL) % + timo = 0; % + else { % getmicrouptime(&rtv); % if (timevalcmp(&rtv, &atv, >=)) Unrelated cleanups of initialization. % @@ -830,13 +819,10 @@ % ttv = atv; % timevalsub(&ttv, &rtv); % - timo = ttv.tv_sec > 24 * 60 * 60 ? % - 24 * 60 * 60 * hz : tvtohz(&ttv); % + timo = tvtohz(&ttv); The special case for timeouts of > 1 day defeats the careful overflow handling in tvtohz(). It is supposed to be for avoiding overflow, but tvtohz() avoids it already, while the above causes it whenever hz is large but not preposterously so, so that 24 * 60 * 60 * hz overflows. hz only needs to be 24586 for overflow. 25000 is almost reasonable for excessive polling, and I have tested lapic timer interrupts though not hz at 1MHz. However, reduction of the timeout to a value that will wake up 10% early is the first step of fixing the bugs discussed above. Reduction to 1 day accomplishes this for timeouts of >= 1.1 days provided it doesn't overflow. % } % % /* % - * An event of interest may occur while we do not hold % - * sellock, so check TDF_SELECT and the number of % - * collisions and rescan the file descriptors if % - * necessary. % + * An event of interest may have occurred while we did not hold % + * sellock. Check for this and rescan if necessary. % */ % mtx_lock_spin(&sched_lock); Unrelated. % @@ -985,4 +978,5 @@ % if (uap->timeout != INFTIM) { % atv.tv_sec = uap->timeout / 1000; % + /* XXX wrong if timeout < 0. */ % atv.tv_usec = (uap->timeout % 1000) * 1000; Since the '%' operator is broken for negative values in C, this gives a negative tv_usec when the timeout is negative. % if (itimerfix(&atv)) { itimerfix() then returns EINVAL, and the syscall fails. But a timeout of < 0 should be equivalent to a timeout of 0, as it is for select() and nanosleep(). This can be implemented either by fixing C, or by fixing the espression, or by just changing negative timeouts to 0. % @@ -992,9 +986,5 @@ % getmicrouptime(&rtv); % timevaladd(&atv, &rtv); % - } else { % - atv.tv_sec = 0; % - atv.tv_usec = 0; % } % - timo = 0; % TAILQ_INIT(&td->td_selq); % mtx_lock(&sellock); % @@ -1006,9 +996,11 @@ % mtx_unlock(&sellock); % % - error = pollscan(td, (struct pollfd *)bits, nfds); % + error = pollscan(td, bits, nfds); % mtx_lock(&sellock); % if (error || td->td_retval[0]) % goto done; % - if (atv.tv_sec || atv.tv_usec) { % + if (uap->timeout == INFTIM) % + timo = 0; % + else { % getmicrouptime(&rtv); % if (timevalcmp(&rtv, &atv, >=)) Unrelated cleanups. % @@ -1016,12 +1008,8 @@ % ttv = atv; % timevalsub(&ttv, &rtv); % - timo = ttv.tv_sec > 24 * 60 * 60 ? % - 24 * 60 * 60 * hz : tvtohz(&ttv); % + timo = tvtohz(&ttv); As for select(). % } % - /* % - * An event of interest may occur while we do not hold % - * sellock, so check TDF_SELECT and the number of collisions % - * and rescan the file descriptors if necessary. % - */ % + % + /* Rescan if necessary, as above. */ Don't repeat comments ad nauseum. There used to be large grammar errors in these comments. -current may have cleaned them up differently. % mtx_lock_spin(&sched_lock); % if ((td->td_flags & TDF_SELECT) == 0 || nselcoll != ncoll) { Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 00:44:04 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 61B8F106564A for ; Thu, 1 Mar 2012 00:44:04 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 228918FC0A for ; Thu, 1 Mar 2012 00:44:03 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 0B3937300A; Thu, 1 Mar 2012 02:02:19 +0100 (CET) Date: Thu, 1 Mar 2012 02:02:19 +0100 From: Luigi Rizzo To: Bruce Evans Message-ID: <20120301010219.GA14508@onelab2.iet.unipi.it> References: <20120229194042.GA10921@onelab2.iet.unipi.it> <20120301071145.O879@besplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120301071145.O879@besplex.bde.org> User-Agent: Mutt/1.4.2.3i Cc: arch@FreeBSD.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 00:44:04 -0000 On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote: > On Wed, 29 Feb 2012, Luigi Rizzo wrote: > > >I have always been annoyed by the fact that FreeBSD rounds timeouts > >in select/usleep/poll in very conservative ways, so i decided to > >try how other systems behave in this respect. Attached is a simple > >program that you should be able to compile and run on various OS > >and see what happens. > > Many are broken, indeed. > > The simple program isn't attached. attachment stripped by the mailing list, retrying to put it inline (and comments on a followup email) ---- /* * test minimum select time * * ./prog usec [method [duration]] */ #include #include #include #include #include #include enum { M_SELECT =0 , M_POLL, M_USLEEP }; static const char *names[] = { "select", "poll", "usleep" }; int main(int argc, char *argv[]) { struct timeval ta, tb; int usec = 1, total = 0, method = M_SELECT, count = 0; if (argc > 1) usec = atoi(argv[1]); if (usec <= 0) usec = 1; else if (usec > 500000) usec = 500000; if (argc > 2) { if (!strcmp(argv[2], "poll")) method = M_POLL; else if (!strcmp(argv[2], "usleep")) method = M_USLEEP; } if (argc > 3) total = atoi(argv[3]); if (total < 1) total = 1; else if (total > 10) total = 10; fprintf(stderr, "testing %s for %dus over %ds\n", names[method], usec, total); gettimeofday(&ta, NULL); for (;;) { if (method == M_SELECT) { struct timeval to = { 0, usec }; select(0, NULL, NULL, NULL, &to); } else if (method == M_POLL) { poll(NULL, 0, usec/1000); } else { usleep(usec); } count++; gettimeofday(&tb, NULL); timersub(&tb, &ta, &tb); if (tb.tv_sec > total) break; } fprintf(stderr, "%dus actually took %dus\n", usec, (int)(tb.tv_sec * 1000000 + tb.tv_usec) / count ); return 0; } ----- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 01:05:01 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4ACD91065673 for ; Thu, 1 Mar 2012 01:05:01 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 066B38FC19 for ; Thu, 1 Mar 2012 01:05:00 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id CC6957300A; Thu, 1 Mar 2012 02:23:15 +0100 (CET) Date: Thu, 1 Mar 2012 02:23:15 +0100 From: Luigi Rizzo To: Bruce Evans Message-ID: <20120301012315.GB14508@onelab2.iet.unipi.it> References: <20120229194042.GA10921@onelab2.iet.unipi.it> <20120301071145.O879@besplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120301071145.O879@besplex.bde.org> User-Agent: Mutt/1.4.2.3i Cc: arch@FreeBSD.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 01:05:01 -0000 On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote: > On Wed, 29 Feb 2012, Luigi Rizzo wrote: > > >I have always been annoyed by the fact that FreeBSD rounds timeouts > >in select/usleep/poll in very conservative ways, so i decided to > >try how other systems behave in this respect. Attached is a simple > >program that you should be able to compile and run on various OS > >and see what happens. > > Many are broken, indeed. > > The simple program isn't attached. ... > > > > | Actual timeout > > | select | poll | usleep| > > timeout | FBSD | Linux | OSX | FBSD | FBSD | > > usec | 9.0 | Vbox | 10.6 | 9.0 | 9.0 | > > --------+-------+-------+--------+-------+-------+ > > 1 2000 99 6 0 2000 > > 10 2000 109 15 0 2000 > > 50 2000 149 66 0 2000 > > 100 2000 196 133 0 2000 > > 500 2000 597 617 0 2000 > > 1000 2000 1103 1136 2000 2000 > > 1001 3000 1103 1136 2000 3000 <--- > > 1500 3000 1608 1631 2000 3000 <--- > > 2000 3000 2096 2127 3000 3000 > > 2001 4000 3000 4000 <--- > > 3001 5000 4000 5000 <--- > > > >Note how the rounding (poll has the timeout in milliseconds) affects > > You must have synced with timer interrupts to get the above. Timeouts yes i have -- the test code does almost nothing after returning from a select, on a system that does some amount of work times could be up to 1000us shorter. Still a huge error on short timeouts. I should also comment that these are average values on an otherwise idle system -- i will try to post a histogram of the actual values, it might well be that osx and linux have quantized values very different from the average (though this would violate the specs, so i suspect instead that they have some cheap one-shot timers). For FreeBSD I have also rounded the bsd values (actual averages are -1/+3us over 1sec experiments). > timeouts at the cost of efficiency. But I don't like timeouts. When > they are used a lot, they are a form of busy waiting. This is only > acceptable if you have CPU to burn). sometimes you have no other way to get a notification. > It remains to explain why the above results show that poll() but not > select() is broken for small timeouts (they are turned into 0 us for no it is just that my application that does the rounding down as the API only accepts milliseconds. Thanks for the extensive comments. cheers luigi From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 01:16:53 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 576021065670 for ; Thu, 1 Mar 2012 01:16:53 +0000 (UTC) (envelope-from bakul@bitblocks.com) Received: from mail.bitblocks.com (ns1.bitblocks.com [173.228.5.8]) by mx1.freebsd.org (Postfix) with ESMTP id 3EC078FC16 for ; Thu, 1 Mar 2012 01:16:53 +0000 (UTC) Received: from bitblocks.com (localhost [127.0.0.1]) by mail.bitblocks.com (Postfix) with ESMTP id CB4AC1CC32; Wed, 29 Feb 2012 16:58:54 -0800 (PST) To: Bruce Evans In-reply-to: Your message of "Thu, 01 Mar 2012 11:33:46 +1100." <20120301071145.O879@besplex.bde.org> References: <20120229194042.GA10921@onelab2.iet.unipi.it> <20120301071145.O879@besplex.bde.org> Comments: In-reply-to Bruce Evans message dated "Thu, 01 Mar 2012 11:33:46 +1100." Date: Wed, 29 Feb 2012 16:58:54 -0800 From: Bakul Shah Message-Id: <20120301005854.CB4AC1CC32@mail.bitblocks.com> Cc: arch@FreeBSD.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 01:16:53 -0000 On Thu, 01 Mar 2012 11:33:46 +1100 Bruce Evans wrote: > Linux and OSX must be using busy-waiting or expensive timer > reprogramming for short timeouts to work. Linux-2.6.17 or later have two options: CONFIG_NO_HZ for on demand timer interrupts (to reduce power use on idle systems) and CONFIG_HIGH_RES_TIMERS for as accurate timers as h/w would allow. And yes, timers are reprogrammed (as per a June 23, 2006 kerneltrap.org article). From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 01:47:02 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 41DE3106564A for ; Thu, 1 Mar 2012 01:47:02 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 068428FC08 for ; Thu, 1 Mar 2012 01:47:00 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id E4EF37300A; Thu, 1 Mar 2012 03:05:15 +0100 (CET) Date: Thu, 1 Mar 2012 03:05:15 +0100 From: Luigi Rizzo To: Bruce Evans Message-ID: <20120301020515.GA14996@onelab2.iet.unipi.it> References: <20120229194042.GA10921@onelab2.iet.unipi.it> <20120301071145.O879@besplex.bde.org> <20120301012315.GB14508@onelab2.iet.unipi.it> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120301012315.GB14508@onelab2.iet.unipi.it> User-Agent: Mutt/1.4.2.3i Cc: arch@FreeBSD.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 01:47:02 -0000 On Thu, Mar 01, 2012 at 02:23:15AM +0100, Luigi Rizzo wrote: > On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote: > > On Wed, 29 Feb 2012, Luigi Rizzo wrote: > > > > >I have always been annoyed by the fact that FreeBSD rounds timeouts > > >in select/usleep/poll in very conservative ways, so i decided to > > >try how other systems behave in this respect. Attached is a simple > > >program that you should be able to compile and run on various OS > > >and see what happens. > > > > Many are broken, indeed. > > > > The simple program isn't attached. > ... > > > > > > > | Actual timeout > > > | select | poll | usleep| > > > timeout | FBSD | Linux | OSX | FBSD | FBSD | > > > usec | 9.0 | Vbox | 10.6 | 9.0 | 9.0 | > > > --------+-------+-------+--------+-------+-------+ > > > 1 2000 99 6 0 2000 > > > 10 2000 109 15 0 2000 > > > 50 2000 149 66 0 2000 > > > 100 2000 196 133 0 2000 > > > 500 2000 597 617 0 2000 > > > 1000 2000 1103 1136 2000 2000 > > > 1001 3000 1103 1136 2000 3000 <--- > > > 1500 3000 1608 1631 2000 3000 <--- > > > 2000 3000 2096 2127 3000 3000 > > > 2001 4000 3000 4000 <--- > > > 3001 5000 4000 5000 <--- > > > > > >Note how the rounding (poll has the timeout in milliseconds) affects > > > > You must have synced with timer interrupts to get the above. Timeouts > > yes i have -- the test code does almost nothing after returning from > a select, on a system that does some amount of work times could be > up to 1000us shorter. Still a huge error on short timeouts. > > I should also comment that these are average values on an otherwise > idle system -- i will try to post a histogram of the actual values, Below are the statistics of select() delays on my MacBook for timeouts of 1-10-50-100-500-1000-1001 us Interesting that some of the delays are actually up to 25us shorter than they should, and the average is higher than the requested value (tends to settle to 100-150us for large delays). > ministat -n ~/d1 ~/d10 ~/d50 ~/d100 ~/d500 ~/d1000 ~/d1001 x /home/luigi/d1 + /home/luigi/d10 * /home/luigi/d50 % /home/luigi/d100 # /home/luigi/d500 @ /home/luigi/d1000 O /home/luigi/d1001 N Min Max Median Avg Stddev x 305202 0 943 7 6.553037 2.134 + 130798 0 862 15 15.290815 2.6807354 * 30265 18 1002 66 66.083562 10.170399 % 14480 75 1072 137 138.12894 29.507796 # 3146 474 1098 656 635.87603 48.670018 @ 1750 987 1924 1158 1143.2394 48.220706 O 1748 986 2337 1159 1144.4102 53.547987 cheers luigi From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 03:14:18 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 98566106564A for ; Thu, 1 Mar 2012 03:14:18 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail28.syd.optusnet.com.au (mail28.syd.optusnet.com.au [211.29.133.169]) by mx1.freebsd.org (Postfix) with ESMTP id 37B998FC0A for ; Thu, 1 Mar 2012 03:14:17 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail28.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q213EEKr031705 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 1 Mar 2012 14:14:15 +1100 Date: Thu, 1 Mar 2012 14:14:14 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Luigi Rizzo In-Reply-To: <20120301012315.GB14508@onelab2.iet.unipi.it> Message-ID: <20120301132806.O2255@besplex.bde.org> References: <20120229194042.GA10921@onelab2.iet.unipi.it> <20120301071145.O879@besplex.bde.org> <20120301012315.GB14508@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 03:14:18 -0000 On Thu, 1 Mar 2012, Luigi Rizzo wrote: > On Thu, Mar 01, 2012 at 11:33:46AM +1100, Bruce Evans wrote: >> On Wed, 29 Feb 2012, Luigi Rizzo wrote: >>> | Actual timeout >>> | select | poll | usleep| >>> timeout | FBSD | Linux | OSX | FBSD | FBSD | >>> usec | 9.0 | Vbox | 10.6 | 9.0 | 9.0 | >>> --------+-------+-------+--------+-------+-------+ >>> 1 2000 99 6 0 2000 >>> 10 2000 109 15 0 2000 >>> 50 2000 149 66 0 2000 >>> 100 2000 196 133 0 2000 >>> 500 2000 597 617 0 2000 >>> 1000 2000 1103 1136 2000 2000 >>> 1001 3000 1103 1136 2000 3000 <--- >>> 1500 3000 1608 1631 2000 3000 <--- >>> 2000 3000 2096 2127 3000 3000 >>> 2001 4000 3000 4000 <--- >>> 3001 5000 4000 5000 <--- >>> >>> Note how the rounding (poll has the timeout in milliseconds) affects >> >> You must have synced with timer interrupts to get the above. Timeouts > > yes i have -- the test code does almost nothing after returning from > a select, on a system that does some amount of work times could be > up to 1000us shorter. Still a huge error on short timeouts. I get the sync but not the rounded timeouts, on my ~5.2 kernel with HZ = 100. The times are typically 19900-19993 for rounding up 1 us to 2 ticks. > I should also comment that these are average values on an otherwise > idle system -- i will try to post a histogram of the actual values, > it might well be that osx and linux have quantized values very > different from the average (though this would violate the specs, > so i suspect instead that they have some cheap one-shot timers). > > For FreeBSD I have also rounded the bsd values (actual averages are -1/+3us > over 1sec experiments). Oh. The jitter is of minor interest, and rounding to usec should show an average of slightly less than the timeout rounded up to ticks (on an unloaded system). Bakul Shah confirmed that Linux now reprograms the timer. It has to, for a tickless kernel. FreeBSD reprograms timers too. I think you can set HZ large and only get timeout interrupts at that frequency if there are active timeouts that need them. Timeout granularity is still 1/HZ. Hmm, this may explain why you are getting exact n000's -- every time you ask for a timeout, you get one n000 us later (on a near-idle machine where nothing else is asking for many timeouts), while old kernels give timeouts on perfectly periodic n000(+error) boundaries; now when the syscall is made just after a boundary, the boundary for the timeout is never a full n000 away. There may be a lot of jitter for both, but if the reprogramming of the timer when you ask for a new timeout is too smart, then the jitter will average out to 0, giving perfect n000's. Try running multiple sources of new timeouts. I think a periodic itimer should produce perfectly periodic ones with little overhead. Then other timeouts should not change the periodicity or even reprogram the timer. Reprogramming on demand seems to give unwanted aperiodicity: you ask for a delay of 1 and get 2000. Suppose you actually want 2000, and actually get it relative to the request time. Then the timer must be interrupting aperiodically, with an average period of 2000+(overhead time of say 2) possibly with large jitter. So 500 of these take 1 second plus 1000 us, plus any jitter (the jitter may be negative, but is most likely positive, since when the process setting up the timeouts is preempted and nothing else is setting them up, there may be a large additional delay). I try to avoid this problem in my version of ping. I try to send a packet on every 1 second boundary. Normal ping tries to send one 1 second after the previous one, but it can't do this since it has overheads and gets preempted. With HZ=100 and rounding up and adding 1, the drift is likely to be 20 msec every second or 2%. This is quite a lot. My version tries to schedule a timeout that expires exactly 1 second after the previous packet was sent, not 1 second after the current time. It takes a simple subtraction to determine the timeout to reach the next seconds boundary, but determining the times to subtract seems to require an extra gettimeofday() call. I should use a periodic itimer and depend on it actually being periodic. The kernel must do similar things to keep periodic itimers actually periodic after it reprograms timers. There may be a lot of jitter on each reprogramming, but this can be compensated for on average. OTOH, as for skewing clocks, the compensation shouldn't go too fast in either direction. This could get complicated. I don't know what -current actually does. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 04:45:30 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 34542106564A for ; Thu, 1 Mar 2012 04:45:30 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail30.syd.optusnet.com.au (mail30.syd.optusnet.com.au [211.29.133.193]) by mx1.freebsd.org (Postfix) with ESMTP id C21BB8FC0C for ; Thu, 1 Mar 2012 04:45:29 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail30.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q214jB7Z030524 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 1 Mar 2012 15:45:16 +1100 Date: Thu, 1 Mar 2012 15:45:11 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans In-Reply-To: <20120301132806.O2255@besplex.bde.org> Message-ID: <20120301143042.F2406@besplex.bde.org> References: <20120229194042.GA10921@onelab2.iet.unipi.it> <20120301071145.O879@besplex.bde.org> <20120301012315.GB14508@onelab2.iet.unipi.it> <20120301132806.O2255@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 04:45:30 -0000 On Thu, 1 Mar 2012, Bruce Evans wrote: > ... > Bakul Shah confirmed that Linux now reprograms the timer. It has to, > for a tickless kernel. FreeBSD reprograms timers too. I think you > can set HZ large and only get timeout interrupts at that frequency if > there are active timeouts that need them. Timeout granularity is still > 1/HZ. I tried this in -current and in a 2008 -current with hz=10000. It worked mediocrely: - the 2008 version gave lapic cpuN: timer interrupts on all CPUs at frequency of almost exactly 10 kHz. This is the behaviour before FreeBSD reprogrammed timers (except the frequency is often off by as much as 10% due to calibration bugs). There were many anomolies in the results from the test program (like select() adding 199 usec and usleep() adding 999 usec). - current gives cpu0: timer interrupts at a frequency of almost exactly 10115 Hz, but only when I watch it using systat over the network (10000 is Hz and the other 115 is presumaby for reprogramming). The other CPU gets many fewer interrupts. When I stop watching, the rates drop towards 9900 for cpu0 and 120 for cpu1. I hoped that there would be only about 50 timer interrupts on the mostly-idle machine. - timeout granularity according to the test program was better than expected. In almost all cases, the timeout was xx99 us. E.g., 1 becomes 200 after rounding up and adding 1 tick, and the result is 199 (since there was 1 us of overhead and no jitter). 1000 became 1099 since rounding up didn't increase it. This is almost better than the OtherOS results (since it has no jitter). I can probably easily beat OtherOS by setting hz to 100000. But I think no jitter is too good to be good. This makes a design bug in poll() very clear. poll() has a timeout granularity of 1 ms, so you can't even asks for timeouts of less than that. Above 1 ms, the extra 99 or 199 us is good enough, and the default of an extra 999 or 1999 us is not too bad. A tickless kernel should have the equivalent of HZ = 0 on idle machines and the equivalant of HZ = huge when something uses lots of timeouts. The latter gives some security problems. You don't want to reprogram timers ever 500 nsec when some untrusted application asks for timeouts of 1000 nsec even if the system can support it. When APIs are fixed to catch up with 1988's timespecs, it will be possible to ask for timeouts of 1 nsec and never get them but waste a lot of cycles. Scheduling is not good enough to disfavour CPU hogs that do things on the nanoseconds scale. I just remembered that precise timeouts are just what is needed for hiding from schedulers. stathz was supposed to be significantly aperiodic and larger than hz so that CPU hogs couldn't use timeouts (based on hz) to hide from schedulers (based on stathz). This was never fully implemented in FreeBSD, and was broken many years ago. In FreeBSD, stathz was normally 128 and aperiod, and just a little larger than hz which was normally 100. But someone broke hz to default to 1000. CPU hogs can now not so easily hide from schedulers by getting timeouts every millisecond and running for about 6 or 7 milliseconds, then sleeping for 2 or 1 millisecond to miss scheduler ticks. With larger hz, the hogs get more control. E.g., HZ = 10000 lets them sleep for only 200 or 100 usec every 78.1 msec to miss scheduler ticks. Reprogramming of timers in -current probably gives significant jitter to timeout boundaries. This can be handled by sleeping for a slightly wider interval. Also, fine-grained timeouts makes allows simpler implementations of this: just wake up every tick, and if you are close to a scheduler tick (which you can predict since they are periodic), then go back to sleep for 1 timeout tick. Since timeout ticks are short relative to scheduler ticks, you get control again soon and then don't have to sleep again for many timeout ticks. No one cares about this because CPUs are now free :-). -current has related fixes and complications in new timer code. Even without malicious CPU hogs, basing statclock and hardclock on the same lapic timer made them too synchronous with each other. The quick fix was to use the i8254 again. This gave a small amount of asynchronicity which was apparently enough to fix the non- malicious case. I didn't like this, and tried to generate some fake asynchronicity in from a single lapic timer. I think it is possible to fake it well enough for the non-malicious case. No one followed up on this. I haven't followed later developments. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 05:42:48 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2C4E0106564A for ; Thu, 1 Mar 2012 05:42:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id BACF68FC14 for ; Thu, 1 Mar 2012 05:42:47 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q215gD7w009742 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 1 Mar 2012 16:42:44 +1100 Date: Thu, 1 Mar 2012 16:42:13 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans In-Reply-To: <20120301143042.F2406@besplex.bde.org> Message-ID: <20120301161011.A2654@besplex.bde.org> References: <20120229194042.GA10921@onelab2.iet.unipi.it> <20120301071145.O879@besplex.bde.org> <20120301012315.GB14508@onelab2.iet.unipi.it> <20120301132806.O2255@besplex.bde.org> <20120301143042.F2406@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 05:42:48 -0000 On Thu, 1 Mar 2012, Bruce Evans wrote: > On Thu, 1 Mar 2012, Bruce Evans wrote: > >> ... >> Bakul Shah confirmed that Linux now reprograms the timer. It has to, >> for a tickless kernel. FreeBSD reprograms timers too. I think you >> can set HZ large and only get timeout interrupts at that frequency if >> there are active timeouts that need them. Timeout granularity is still >> 1/HZ. > > I tried this in -current and in a 2008 -current with hz=10000. It worked > mediocrely: > - the 2008 version gave lapic cpuN: timer interrupts on all CPUs at > frequency of almost exactly 10 kHz. This is the behaviour before > FreeBSD reprogrammed timers (except the frequency is often off by > as much as 10% due to calibration bugs). There were many anomolies > in the results from the test program (like select() adding 199 usec > and usleep() adding 999 usec). > - [... no surprises in -current] I tried this in -current with hz=100000. This gives (some not very surprising) behaviour: - systat claims ~100% idle, but the ~100k interrupts on 1 CPU actually reduces performance by 33% (two CPUs take 30 seconds user time to do what can be done in 20 seconds user time with hz=100). This is a normal problem with fast interrupt handlers. They need a faster interrupt handler to account for them properly. - ./prog 1 select works reasonably. It reports timeouts of 29-30 us. I expected 19-20. - ./prog 1 poll is broken as we know. It asks for timeouts of 0 and takes 3 us. - ./prog 1 usleep shows brokenness. It reports timeouts of 999 us. I think this is due to getnanouptime()'s brokenness. $(sysctl kern.timecounter.tick) is 100. This reduces getnanouptime()'s accuracy back to to 1 msec, which explains the 999 us. But why doesn't select() have the same problem? select() uses getmicrouptime(), but it has the same brokenness. The sysctl is r/o, so I couldn't use it easily. I have changed tc_tick using ddb before, but don't want to risk reducing it by a factor of 100. The timecounter update algorithm depends on the timehands not being recycled too fast, and probably couldn't copy with recycling 100 times faster. - ./prog 1000 select and ./prog 1000 poll take 20 us extra. I expected 9-10 extra. - ./prog 1000 usleep takes 619-693 us extra. Not the full extra 100 ticks from getnanouptime() fuzziness now. - ./prog 500000 usleep takes 500026-500885 us. Even higher variance which agrees with the fuzziness better. select and poll with this timeout still have accuracy and low variance (21-26 us extra). The fuzzy versions are actually useful for optimization after all: - for long timeouts, use the fuzzy versions and accept their inaccuracies. Sleep longer by the amount fuzziness so that sleeps are never too short. - for short timeouts, it seems necessary for the initial timestamp to be accuarate. When checking if the timeout has expired, first try a fuzzy check. This is sufficent if the current fuzzy time is far from the expiry time. Bruce From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 11:46:55 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 0AD59106566C; Thu, 1 Mar 2012 11:46:55 +0000 (UTC) (envelope-from gleb.kurtsou@gmail.com) Received: from mail-bk0-f54.google.com (mail-bk0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 493D58FC16; Thu, 1 Mar 2012 11:46:54 +0000 (UTC) Received: by bkcjc3 with SMTP id jc3so505027bkc.13 for ; Thu, 01 Mar 2012 03:46:53 -0800 (PST) Received-SPF: pass (google.com: domain of gleb.kurtsou@gmail.com designates 10.112.10.169 as permitted sender) client-ip=10.112.10.169; Authentication-Results: mr.google.com; spf=pass (google.com: domain of gleb.kurtsou@gmail.com designates 10.112.10.169 as permitted sender) smtp.mail=gleb.kurtsou@gmail.com; dkim=pass header.i=gleb.kurtsou@gmail.com Received: from mr.google.com ([10.112.10.169]) by 10.112.10.169 with SMTP id j9mr2285243lbb.70.1330602413304 (num_hops = 1); Thu, 01 Mar 2012 03:46:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=YwAEYHPwri+M8qawFfwO1wxQGn4Fyk+2Z3ik2ui6j6c=; b=wUahYKlnXH/SmwD/aCBXzcM6OZy2jpQoAcUITlSyAelIeLK3igpYopnu6B/2pxqfj8 rYxij/kWrEKmAHNToItBzDSf6eEKfuIKjQ77cl23pXkLp0PHtIRk5iUxUuSuzbV/+Z7S kaaAARi8zt/DvRoMtjx7qzDv5PmbXrVV9+HgQ= Received: by 10.112.10.169 with SMTP id j9mr1820289lbb.70.1330600584015; Thu, 01 Mar 2012 03:16:24 -0800 (PST) Received: from localhost ([78.157.92.5]) by mx.google.com with ESMTPS id b3sm2460510lby.7.2012.03.01.03.16.22 (version=SSLv3 cipher=OTHER); Thu, 01 Mar 2012 03:16:22 -0800 (PST) Date: Thu, 1 Mar 2012 13:16:24 +0200 From: Gleb Kurtsou To: Pawel Jakub Dawidek Message-ID: <20120301111624.GB30991@reks> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20120225194630.GI1344@garage.freebsd.pl> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Attilio Rao , Konstantin Belousov , arch@freebsd.org Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 11:46:55 -0000 On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote: > > Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek ha scritto: > > > I personal opinion about rangelocks and many other VFS features we > > > currently have is that it is good idea in theory, but in practise it > > > tends to overcomplicate VFS. > > > > > > I'm in opinion that we should move as much stuff as we can to individual > > > file systems. We try to implement everything in VFS itself in hope that > > > this will simplify file systems we have. It then turns out only one file > > > system is really using this stuff (most of the time it is UFS) and this > > > is PITA for all the other file systems as well as maintaining VFS. VFS > > > became so complicated over the years that there are maybe few people > > > that can understand it, and every single change to VFS is a huge risk of > > > potentially breaking some unrelated parts. > > > > I think this is questionable due to the following assets: > > - If the problem is filesystems writers having trouble in > > understanding the necessary locking we should really provide cleaner > > and more complete documentation. One would think the same with our VM > > subsystem, but at least in that case there is plenty of comments that > > help understanding how to deal with vm_object, vm_pages locking during > > their lifelines. > > Documentation is not the answer here. If the code is so complex it is > harder to learn, no matter how good the documentation is, it makes less > people willing to learn it in the first place and it makes the code more > buggy, because there are more edge/special cases you can forget about. > > > - Our primitives may be more complicated than the > > 'all-in-the-filesystem' one, but at least they offer a complete and > > centralized view over the resources we have allocated in the whole > > system and they allow building better policies about how to manage > > them. One problem I see here, is that those policies are not fully > > implemented, tuned or just got outdated, removing one of the highest > > beneficial that we have by making vnodes so generic > > Again, this is only nice theory, that is far from being the reality. > You will never be able to have control on all the resources allocated by > file systems. > > > About the thing I mentioned myself: > > - As long as the same path now has both range-locking and vnode > > locking I don't see as a good idea to keep both separated forever. > > Merging them seems to me an important evolution (not only helping > > shrinking the number of primitives themselves but also introducing > > less overhead and likely rewamped scalability for vnodes (but I think > > this needs a deep investigation). > > - About ZFS rangelocks absorbing the VFS ones, I think this is a minor > > point, but still, if you think it can be done efficiently and without > > loosing performance I don't see why not do that. You already wrote > > rangelocks for ZFS, so you are have earned a big experience in this > > area and can comment on fallouts, etc., but I don't see a good reason > > to not do that, unless it is just too difficult. This is not about > > generalizing a new mechanism, it is using a general mechanism in a > > specific implementation, if possible. > > I did not implement rangelocking for ZFS. It came with ZFS when I ported > it. Until we want to merge changes from upstream (which is now IllumOS) > we don't want to make huge changes just for the sake of proving that > this is general purpose mechanism used by more than one file system. > > Attilio, don't get me wrong. In 99% cases it is good to make code more > general and more universal and reusable, but we can't ignore reality. > > There are reasons why file systems like XFS, ReiserFS and others where > never fully ported. I'm not saying VFS complexity was the only reason, > but I'm sure it was one of them. > > Our VFS is very UFS-centric. We make so many assumptions that sounds > fine only for UFS. I saw plenty of those while working on ZFS, like: > > - "Every file system needs cache. Let's make it general, so that all file > systems can use it!" Well, for VFS each file system is a separate > entity, which is not the case for ZFS. ZFS can cache one block only > once that is used by one file system, 10 clones and 100 snapshots, > which all are separate mount points from VFS perspective. > The same block would be cached 111 times by the buffer cache. Hmm. But this one is optional. Use vop_cachedlookup (or call cache_entry() on your own), add a number of cache_prune calls. It's pretty much library-like design you describe below. > > - "rmdir(2) on a mountpoint is bad idea, let's deny it at VFS level." > It is bad idea, indeed, but in ZFS it is a nice way to remove snapshot > by rmdiring .zfs/snapshot/ directory. > > - Noone implemented rangelocking in VFS, so no file system can use it. > Even if the given file system has all the code to do it. > > etc. > > I'm also sure it will be way easier for Jeff to make VFS MP-safe if it > was less complex. Everybody agrees that VFS needs more care. But there haven't been much of concrete suggestions or at least there is no VFS TODO list. > When looking at the big picture, it would be nice to have all this > general stuff like rangelocking, quota, buffer cache, etc. as some kind > of libraries for file systems to use and not something that is > mandatory. If I develop a file system for FreeBSD only and I don't want > to reinvent the wheel, I can use those libraries. If I port file system > to FreeBSD or develop a file system that doesn't really need those > libraries I'm not forced to use them. Are you aware of a real "libraries for file systems" VFS example? It sounds very interesting but I'm afraid it's going to look good only in theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks rather messy (IMHO) and more likely to be bug prone. On the other side Linux has optional per file system rename lock making VOP_RENAME implementation much easier, while ours is tremendously difficult to do right. > All this might make a good working group subject at BSDCan devsummit. > We could cross swords there:) Unfortunately I'm afraid I won't make there too. And most likely will miss EuroBSD/MeetBSD 2012 in Warsaw as well. I have a number of fresh ideas about namecache I'd love to discuss. What do you think about organising preliminary group meeting on fs@ or arch@? :) > > -- > Pawel Jakub Dawidek http://www.wheelsystems.com > FreeBSD committer http://www.FreeBSD.org > Am I Evil? Yes, I Am! http://tupytaj.pl From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 14:10:14 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 38250106564A; Thu, 1 Mar 2012 14:10:14 +0000 (UTC) (envelope-from BATV+c43791b1943af153f85b+3112+infradead.org+hch@bombadil.srs.infradead.org) Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2001:4830:2446:ff00:4687:fcff:fea6:5117]) by mx1.freebsd.org (Postfix) with ESMTP id D66448FC12; Thu, 1 Mar 2012 14:10:12 +0000 (UTC) Received: from hch by bombadil.infradead.org with local (Exim 4.76 #1 (Red Hat Linux)) id 1S36hu-0004G6-VV; Thu, 01 Mar 2012 14:10:11 +0000 Date: Thu, 1 Mar 2012 09:10:10 -0500 From: Christoph Hellwig To: Gleb Kurtsou Message-ID: <20120301141010.GA7079@infradead.org> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120301111624.GB30991@reks> User-Agent: Mutt/1.5.21 (2010-09-15) X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org See http://www.infradead.org/rpr.html Cc: Attilio Rao , Konstantin Belousov , Pawel Jakub Dawidek , arch@freebsd.org Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 14:10:14 -0000 On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > Are you aware of a real "libraries for file systems" VFS example? It > sounds very interesting but I'm afraid it's going to look good only in > theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks > rather messy (IMHO) and more likely to be bug prone. On the other side > Linux has optional per file system rename lock making VOP_RENAME > implementation much easier, while ours is tremendously difficult to do > right. All namespace locking in Linux is in the VFS, and it mandatory. A filesystem wide lock is only used for cross-directory renames. A more detailed description is here: http://git.kernel.dk/?p=linux.git;a=blob;f=Documentation/filesystems/directory-locking From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 14:14:07 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E0D921065672; Thu, 1 Mar 2012 14:14:07 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.dawidek.net (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id 4FE0D8FC0A; Thu, 1 Mar 2012 14:14:07 +0000 (UTC) Received: from localhost (58.wheelsystems.com [83.12.187.58]) by mail.dawidek.net (Postfix) with ESMTPSA id 19F6C12D; Thu, 1 Mar 2012 15:14:05 +0100 (CET) Date: Thu, 1 Mar 2012 15:12:47 +0100 From: Pawel Jakub Dawidek To: Gleb Kurtsou Message-ID: <20120301141247.GE1336@garage.freebsd.pl> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="qp4W5+cUSnZs0RIF" Content-Disposition: inline In-Reply-To: <20120301111624.GB30991@reks> X-OS: FreeBSD 10.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Attilio Rao , Konstantin Belousov , arch@freebsd.org Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 14:14:08 -0000 --qp4W5+cUSnZs0RIF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > > - "Every file system needs cache. Let's make it general, so that all fi= le > > systems can use it!" Well, for VFS each file system is a separate > > entity, which is not the case for ZFS. ZFS can cache one block only > > once that is used by one file system, 10 clones and 100 snapshots, > > which all are separate mount points from VFS perspective. > > The same block would be cached 111 times by the buffer cache. >=20 > Hmm. But this one is optional. Use vop_cachedlookup (or call > cache_entry() on your own), add a number of cache_prune calls. It's > pretty much library-like design you describe below. Yes, namecache is already library-like, but I was talking about the buffer cache. I managed to bypass it eventually with suggestions from ups@, but for a long time I was sure it isn't at all possible. > Everybody agrees that VFS needs more care. But there haven't been much > of concrete suggestions or at least there is no VFS TODO list. Everybody agrees on that, true, but we disagree on the direction we should move our VFS, ie. make it more light-weight vs. more heavy-weight. > > When looking at the big picture, it would be nice to have all this > > general stuff like rangelocking, quota, buffer cache, etc. as some kind > > of libraries for file systems to use and not something that is > > mandatory. If I develop a file system for FreeBSD only and I don't want > > to reinvent the wheel, I can use those libraries. If I port file system > > to FreeBSD or develop a file system that doesn't really need those > > libraries I'm not forced to use them. >=20 > Are you aware of a real "libraries for file systems" VFS example? It > sounds very interesting but I'm afraid it's going to look good only in > theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks > rather messy (IMHO) and more likely to be bug prone. On the other side > Linux has optional per file system rename lock making VOP_RENAME > implementation much easier, while ours is tremendously difficult to do > right. There are not many examples for such libraries, but the namecache is one of them. Things like rangelocking definiately look like a good candidate to make it a library. > > All this might make a good working group subject at BSDCan devsummit. > > We could cross swords there:) >=20 > Unfortunately I'm afraid I won't make there too. And most likely will > miss EuroBSD/MeetBSD 2012 in Warsaw as well. I have a number of fresh > ideas about namecache I'd love to discuss. What do you think about > organising preliminary group meeting on fs@ or arch@? :) Sounds good. Both forums seems suitable, just pick one. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://tupytaj.pl --qp4W5+cUSnZs0RIF Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (FreeBSD) iEYEARECAAYFAk9Pg98ACgkQForvXbEpPzSpdACfdxehVqvpgF/3wXtT3OJCIw0Z GOMAoKlqRr5LjBU7koitFf+7VGbMC6z+ =4IE/ -----END PGP SIGNATURE----- --qp4W5+cUSnZs0RIF-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 14:16:02 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E136D106564A; Thu, 1 Mar 2012 14:16:02 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 33CD68FC15; Thu, 1 Mar 2012 14:16:01 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21EFsRB090647; Thu, 1 Mar 2012 16:15:54 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q21EFrpx074450; Thu, 1 Mar 2012 16:15:53 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21EFrg1074449; Thu, 1 Mar 2012 16:15:53 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 1 Mar 2012 16:15:53 +0200 From: Konstantin Belousov To: Pawel Jakub Dawidek Message-ID: <20120301141553.GT55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="nlOp58TzLcTjnOVc" Content-Disposition: inline In-Reply-To: <20120301141247.GE1336@garage.freebsd.pl> User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: Attilio Rao , arch@FreeBSD.org, Gleb Kurtsou Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 14:16:03 -0000 --nlOp58TzLcTjnOVc Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 03:12:47PM +0100, Pawel Jakub Dawidek wrote: > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > > On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > > > - "Every file system needs cache. Let's make it general, so that all = file > > > systems can use it!" Well, for VFS each file system is a separate > > > entity, which is not the case for ZFS. ZFS can cache one block only > > > once that is used by one file system, 10 clones and 100 snapshots, > > > which all are separate mount points from VFS perspective. > > > The same block would be cached 111 times by the buffer cache. > >=20 > > Hmm. But this one is optional. Use vop_cachedlookup (or call > > cache_entry() on your own), add a number of cache_prune calls. It's > > pretty much library-like design you describe below. >=20 > Yes, namecache is already library-like, but I was talking about the > buffer cache. I managed to bypass it eventually with suggestions from > ups@, but for a long time I was sure it isn't at all possible. I am quite curious, in which way buffer layer is mandatory ? --nlOp58TzLcTjnOVc Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9PhJkACgkQC3+MBN1Mb4gPEACgz/9StyTUKFfToGQFVaUgJWpq SI8An0aCnA/fz8EySQ7u1IrO3JxLSIRr =4S1J -----END PGP SIGNATURE----- --nlOp58TzLcTjnOVc-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 14:28:45 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E5CBD1065670; Thu, 1 Mar 2012 14:28:45 +0000 (UTC) (envelope-from pawel@dawidek.net) Received: from mail.dawidek.net (60.wheelsystems.com [83.12.187.60]) by mx1.freebsd.org (Postfix) with ESMTP id 8F5318FC08; Thu, 1 Mar 2012 14:28:45 +0000 (UTC) Received: from localhost (58.wheelsystems.com [83.12.187.58]) by mail.dawidek.net (Postfix) with ESMTPSA id 169C413C; Thu, 1 Mar 2012 15:28:44 +0100 (CET) Date: Thu, 1 Mar 2012 15:27:27 +0100 From: Pawel Jakub Dawidek To: Konstantin Belousov Message-ID: <20120301142726.GF1336@garage.freebsd.pl> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301141553.GT55074@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="KR/qxknboQ7+Tpez" Content-Disposition: inline In-Reply-To: <20120301141553.GT55074@deviant.kiev.zoral.com.ua> X-OS: FreeBSD 10.0-CURRENT amd64 User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Attilio Rao , arch@FreeBSD.org, Gleb Kurtsou Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 14:28:46 -0000 --KR/qxknboQ7+Tpez Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 04:15:53PM +0200, Konstantin Belousov wrote: > On Thu, Mar 01, 2012 at 03:12:47PM +0100, Pawel Jakub Dawidek wrote: > > Yes, namecache is already library-like, but I was talking about the > > buffer cache. I managed to bypass it eventually with suggestions from > > ups@, but for a long time I was sure it isn't at all possible. >=20 > I am quite curious, in which way buffer layer is mandatory ? As I said, it is not, but it took me a while to figure it out. I remember having massive problems when I was working on getting mmaped reads/writes right and bypassing the buffer cache and talking to the page cache directly. I don't think there was single example in the tree that was showing it can be done at that time. Currently tmpfs is using the same approach as ZFS, AFAIK. --=20 Pawel Jakub Dawidek http://www.wheelsystems.com FreeBSD committer http://www.FreeBSD.org Am I Evil? Yes, I Am! http://tupytaj.pl --KR/qxknboQ7+Tpez Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (FreeBSD) iEYEARECAAYFAk9Ph04ACgkQForvXbEpPzS4gwCgiqSLlzrJ2LRC4FHPSOVsjCQd ZbwAn1yCaWUq3kik4zzQ+ClcPCQsUpbk =LM1U -----END PGP SIGNATURE----- --KR/qxknboQ7+Tpez-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 14:32:36 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 48764106564A; Thu, 1 Mar 2012 14:32:35 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id 82FA68FC15; Thu, 1 Mar 2012 14:32:34 +0000 (UTC) Received: by lagv3 with SMTP id v3so1128172lag.13 for ; Thu, 01 Mar 2012 06:32:33 -0800 (PST) Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates 10.152.130.234 as permitted sender) client-ip=10.152.130.234; Authentication-Results: mr.google.com; spf=pass (google.com: domain of asmrookie@gmail.com designates 10.152.130.234 as permitted sender) smtp.mail=asmrookie@gmail.com; dkim=pass header.i=asmrookie@gmail.com Received: from mr.google.com ([10.152.130.234]) by 10.152.130.234 with SMTP id oh10mr5287243lab.35.1330612353335 (num_hops = 1); Thu, 01 Mar 2012 06:32:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=CF49K3DG02gnM8s0VOwA8l3F/1p+FTtEU/X2ha2/Gl0=; b=XdzsmARVScG7NzwxZEDa4bbrIMotavnW0JUFPi1dkuEiEAnIhGsibGGkGbZi3WcR/k PXHqi8ew3amXf8w+Zn/AjyeyGZ0jkkIowWeio8yp8AWLl82N/cQ1NBgKNpQjk/sahuJ5 Rm2Qm1Eza0V9nmtmdhxKz/vx/cHR4S2vxtYBE= MIME-Version: 1.0 Received: by 10.152.130.234 with SMTP id oh10mr4299652lab.35.1330612353193; Thu, 01 Mar 2012 06:32:33 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 06:32:33 -0800 (PST) In-Reply-To: <20120301141247.GE1336@garage.freebsd.pl> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> Date: Thu, 1 Mar 2012 14:32:33 +0000 X-Google-Sender-Auth: W8QAl2NJ7vStiNdX2HLNoulbfYk Message-ID: From: Attilio Rao To: Pawel Jakub Dawidek Content-Type: text/plain; charset=UTF-8 Cc: Konstantin Belousov , arch@freebsd.org, Gleb Kurtsou Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 14:32:36 -0000 2012/3/1, Pawel Jakub Dawidek : > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: >> > - "Every file system needs cache. Let's make it general, so that all >> > file >> > systems can use it!" Well, for VFS each file system is a separate >> > entity, which is not the case for ZFS. ZFS can cache one block only >> > once that is used by one file system, 10 clones and 100 snapshots, >> > which all are separate mount points from VFS perspective. >> > The same block would be cached 111 times by the buffer cache. >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call >> cache_entry() on your own), add a number of cache_prune calls. It's >> pretty much library-like design you describe below. > > Yes, namecache is already library-like, but I was talking about the > buffer cache. I managed to bypass it eventually with suggestions from > ups@, but for a long time I was sure it isn't at all possible. Can you please clarify on this as I really don't understand what you mean? > >> Everybody agrees that VFS needs more care. But there haven't been much >> of concrete suggestions or at least there is no VFS TODO list. > > Everybody agrees on that, true, but we disagree on the direction we > should move our VFS, ie. make it more light-weight vs. more heavy-weight. All I'm saying (and Gleb too) is that I don't see any benefit in replicating all the vnodes lifecycle at the inode level and in the filesystem specific implementation. I don't see a semplification in the work to do, I don't think this is going to be simpler for a single specific filesystem (without mentioning the legacy support, which means re-implement inode handling for every filesystem we have now), we just loose generality. if you want a good example of a VFS primitive that was really UFS-centric and it was mistakenly made generic is vn_start_write() and sibillings. I guess it was introduced just to cater UFS snapshot creation and then it poisoned other consumers. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 14:36:21 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4955F1065675; Thu, 1 Mar 2012 14:36:21 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 7AD078FC14; Thu, 1 Mar 2012 14:36:20 +0000 (UTC) Received: by eaaf13 with SMTP id f13so222528eaa.13 for ; Thu, 01 Mar 2012 06:36:19 -0800 (PST) Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates 10.112.27.199 as permitted sender) client-ip=10.112.27.199; Authentication-Results: mr.google.com; spf=pass (google.com: domain of asmrookie@gmail.com designates 10.112.27.199 as permitted sender) smtp.mail=asmrookie@gmail.com; dkim=pass header.i=asmrookie@gmail.com Received: from mr.google.com ([10.112.27.199]) by 10.112.27.199 with SMTP id v7mr2458137lbg.36.1330612579401 (num_hops = 1); Thu, 01 Mar 2012 06:36:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=eGvqhipt++SM96LhvBDNx+Ta1D3DuVoerEvDM1MMmlk=; b=QuQ3durY5OIXeavgi0K9HIuoa2VVgvIeqE7iOU5x/yjpUfTAIKOnq01m8bFlr3cZdn pixX2etJAtbihQGXybE6x4D68vWY9E3vRIai2hhG/RWLVf03gp0BJZ9i+ZAqFhQUwJ1Y 35bvu+1d1PRfnelCv75ltxfm8CQGDSauFWCSY= MIME-Version: 1.0 Received: by 10.112.27.199 with SMTP id v7mr2009638lbg.36.1330612579301; Thu, 01 Mar 2012 06:36:19 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 06:36:19 -0800 (PST) In-Reply-To: <20120301111624.GB30991@reks> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> Date: Thu, 1 Mar 2012 14:36:19 +0000 X-Google-Sender-Auth: a0TLwnCjBFXEHM_4CRt8Ad0AM1Q Message-ID: From: Attilio Rao To: Gleb Kurtsou Content-Type: text/plain; charset=UTF-8 Cc: Konstantin Belousov , arch@freebsd.org, Jeff Roberson , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 14:36:21 -0000 2012/3/1, Gleb Kurtsou : > On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: [snip] >> When looking at the big picture, it would be nice to have all this >> general stuff like rangelocking, quota, buffer cache, etc. as some kind >> of libraries for file systems to use and not something that is >> mandatory. If I develop a file system for FreeBSD only and I don't want >> to reinvent the wheel, I can use those libraries. If I port file system >> to FreeBSD or develop a file system that doesn't really need those >> libraries I'm not forced to use them. > > Are you aware of a real "libraries for file systems" VFS example? It > sounds very interesting but I'm afraid it's going to look good only in > theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks > rather messy (IMHO) and more likely to be bug prone. On the other side > Linux has optional per file system rename lock making VOP_RENAME > implementation much easier, while ours is tremendously difficult to do > right. I think Jeff (CC'ed) had fixed this (maybe only for UFS, cannot recall now) and he had a very good reason for not using Linux approach, which I don't recall now. Thanks, Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 14:47:20 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 430CD106566C; Thu, 1 Mar 2012 14:47:20 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id C5DB68FC19; Thu, 1 Mar 2012 14:47:19 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21El8Yv094699; Thu, 1 Mar 2012 16:47:08 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q21El82I074728; Thu, 1 Mar 2012 16:47:08 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21El8bh074727; Thu, 1 Mar 2012 16:47:08 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 1 Mar 2012 16:47:08 +0200 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120301144708.GV55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="u2jkDaBVK38P9/ME" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 14:47:20 -0000 --u2jkDaBVK38P9/ME Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: > 2012/3/1, Pawel Jakub Dawidek : > > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > >> > - "Every file system needs cache. Let's make it general, so that all > >> > file > >> > systems can use it!" Well, for VFS each file system is a separate > >> > entity, which is not the case for ZFS. ZFS can cache one block only > >> > once that is used by one file system, 10 clones and 100 snapshots, > >> > which all are separate mount points from VFS perspective. > >> > The same block would be cached 111 times by the buffer cache. > >> > >> Hmm. But this one is optional. Use vop_cachedlookup (or call > >> cache_entry() on your own), add a number of cache_prune calls. It's > >> pretty much library-like design you describe below. > > > > Yes, namecache is already library-like, but I was talking about the > > buffer cache. I managed to bypass it eventually with suggestions from > > ups@, but for a long time I was sure it isn't at all possible. >=20 > Can you please clarify on this as I really don't understand what you mean? >=20 > > > >> Everybody agrees that VFS needs more care. But there haven't been much > >> of concrete suggestions or at least there is no VFS TODO list. > > > > Everybody agrees on that, true, but we disagree on the direction we > > should move our VFS, ie. make it more light-weight vs. more heavy-weigh= t. >=20 > All I'm saying (and Gleb too) is that I don't see any benefit in > replicating all the vnodes lifecycle at the inode level and in the > filesystem specific implementation. > I don't see a semplification in the work to do, I don't think this is > going to be simpler for a single specific filesystem (without > mentioning the legacy support, which means re-implement inode handling > for every filesystem we have now), we just loose generality. >=20 > if you want a good example of a VFS primitive that was really > UFS-centric and it was mistakenly made generic is vn_start_write() and > sibillings. I guess it was introduced just to cater UFS snapshot > creation and then it poisoned other consumers. vn_start_write() has nothing to do with filesystem code at all. It is purely VFS layer operation, which shall not be called from fs code at all. vn_start_secondary_write() is sometimes useful for the filesystem itself. Suspension (not snapshotting) is very useful and allows to avoid some nasty issues with unmounts, remounts or guaranteed syncing of the filesystem. The fact that only UFS utilizes this functionality just shows that other filesystem implementors do not care about this correctness, or that other filesystems are not maintained. --u2jkDaBVK38P9/ME Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9Pi+sACgkQC3+MBN1Mb4j+DQCgzNdcihDFivaI+KoVGwEIcmRX LwMAnRAVHLgnFi+aeFHTTtPjRfwSLuQg =dNA5 -----END PGP SIGNATURE----- --u2jkDaBVK38P9/ME-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 14:50:42 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 552651065670; Thu, 1 Mar 2012 14:50:42 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54]) by mx1.freebsd.org (Postfix) with ESMTP id 9E3CD8FC0A; Thu, 1 Mar 2012 14:50:41 +0000 (UTC) Received: by eekd17 with SMTP id d17so238954eek.13 for ; Thu, 01 Mar 2012 06:50:40 -0800 (PST) Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates 10.112.9.34 as permitted sender) client-ip=10.112.9.34; Authentication-Results: mr.google.com; spf=pass (google.com: domain of asmrookie@gmail.com designates 10.112.9.34 as permitted sender) smtp.mail=asmrookie@gmail.com; dkim=pass header.i=asmrookie@gmail.com Received: from mr.google.com ([10.112.9.34]) by 10.112.9.34 with SMTP id w2mr2505416lba.50.1330613440553 (num_hops = 1); Thu, 01 Mar 2012 06:50:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=mA9gX02h1E1X8c4klSuRIB+v++7vt1gKfpjb4ubkk7g=; b=lBj1Ms8fEZWLMyo4yzdIWwNkR+kmuPIxVq2bRgl2LMnjbiTPk6+pCw1lRvUEvFFZ6W 4GDm0GMNB4VetI7eGcbQzbQtNpQ9MkILxnG+rEVeO49HnDox9tomIxM8Qe2+W0IWzo0b 6bE7IARgvFvno7yFOOm4nd6vd+diSDZrNimUI= MIME-Version: 1.0 Received: by 10.112.9.34 with SMTP id w2mr2039040lba.50.1330613440418; Thu, 01 Mar 2012 06:50:40 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 06:50:40 -0800 (PST) In-Reply-To: <20120301144708.GV55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> Date: Thu, 1 Mar 2012 14:50:40 +0000 X-Google-Sender-Auth: 4lXrWlAYMBLYVKSXPymZfwnl-Ps Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 14:50:42 -0000 2012/3/1, Konstantin Belousov : > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: >> 2012/3/1, Pawel Jakub Dawidek : >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: >> >> > - "Every file system needs cache. Let's make it general, so that all >> >> > file >> >> > systems can use it!" Well, for VFS each file system is a separate >> >> > entity, which is not the case for ZFS. ZFS can cache one block only >> >> > once that is used by one file system, 10 clones and 100 snapshots, >> >> > which all are separate mount points from VFS perspective. >> >> > The same block would be cached 111 times by the buffer cache. >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call >> >> cache_entry() on your own), add a number of cache_prune calls. It's >> >> pretty much library-like design you describe below. >> > >> > Yes, namecache is already library-like, but I was talking about the >> > buffer cache. I managed to bypass it eventually with suggestions from >> > ups@, but for a long time I was sure it isn't at all possible. >> >> Can you please clarify on this as I really don't understand what you mean? >> >> > >> >> Everybody agrees that VFS needs more care. But there haven't been much >> >> of concrete suggestions or at least there is no VFS TODO list. >> > >> > Everybody agrees on that, true, but we disagree on the direction we >> > should move our VFS, ie. make it more light-weight vs. more >> > heavy-weight. >> >> All I'm saying (and Gleb too) is that I don't see any benefit in >> replicating all the vnodes lifecycle at the inode level and in the >> filesystem specific implementation. >> I don't see a semplification in the work to do, I don't think this is >> going to be simpler for a single specific filesystem (without >> mentioning the legacy support, which means re-implement inode handling >> for every filesystem we have now), we just loose generality. >> >> if you want a good example of a VFS primitive that was really >> UFS-centric and it was mistakenly made generic is vn_start_write() and >> sibillings. I guess it was introduced just to cater UFS snapshot >> creation and then it poisoned other consumers. > > vn_start_write() has nothing to do with filesystem code at all. > It is purely VFS layer operation, which shall not be called from fs > code at all. vn_start_secondary_write() is sometimes useful for the > filesystem itself. > > Suspension (not snapshotting) is very useful and allows to avoid some > nasty issues with unmounts, remounts or guaranteed syncing of the > filesystem. The fact that only UFS utilizes this functionality just > shows that other filesystem implementors do not care about this > correctness, or that other filesystems are not maintained. I'm sure that when I looked into it only UFS suspension was being touched by it and it was introduced back in the days when snapshotting was sanitized. So what are the races it is supposed to fix and other filesystems don't care about? Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 15:01:36 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 459E2106564A; Thu, 1 Mar 2012 15:01:36 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 79B098FC1B; Thu, 1 Mar 2012 15:01:34 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21F1QfH096322; Thu, 1 Mar 2012 17:01:26 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q21F1PNW074838; Thu, 1 Mar 2012 17:01:25 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21F1PYs074837; Thu, 1 Mar 2012 17:01:25 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 1 Mar 2012 17:01:25 +0200 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120301150125.GX55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="o+ErJpKw5D0ndpyV" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 15:01:36 -0000 --o+ErJpKw5D0ndpyV Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: > 2012/3/1, Konstantin Belousov : > > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: > >> 2012/3/1, Pawel Jakub Dawidek : > >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > >> >> > - "Every file system needs cache. Let's make it general, so that = all > >> >> > file > >> >> > systems can use it!" Well, for VFS each file system is a separa= te > >> >> > entity, which is not the case for ZFS. ZFS can cache one block = only > >> >> > once that is used by one file system, 10 clones and 100 snapsho= ts, > >> >> > which all are separate mount points from VFS perspective. > >> >> > The same block would be cached 111 times by the buffer cache. > >> >> > >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call > >> >> cache_entry() on your own), add a number of cache_prune calls. It's > >> >> pretty much library-like design you describe below. > >> > > >> > Yes, namecache is already library-like, but I was talking about the > >> > buffer cache. I managed to bypass it eventually with suggestions from > >> > ups@, but for a long time I was sure it isn't at all possible. > >> > >> Can you please clarify on this as I really don't understand what you m= ean? > >> > >> > > >> >> Everybody agrees that VFS needs more care. But there haven't been m= uch > >> >> of concrete suggestions or at least there is no VFS TODO list. > >> > > >> > Everybody agrees on that, true, but we disagree on the direction we > >> > should move our VFS, ie. make it more light-weight vs. more > >> > heavy-weight. > >> > >> All I'm saying (and Gleb too) is that I don't see any benefit in > >> replicating all the vnodes lifecycle at the inode level and in the > >> filesystem specific implementation. > >> I don't see a semplification in the work to do, I don't think this is > >> going to be simpler for a single specific filesystem (without > >> mentioning the legacy support, which means re-implement inode handling > >> for every filesystem we have now), we just loose generality. > >> > >> if you want a good example of a VFS primitive that was really > >> UFS-centric and it was mistakenly made generic is vn_start_write() and > >> sibillings. I guess it was introduced just to cater UFS snapshot > >> creation and then it poisoned other consumers. > > > > vn_start_write() has nothing to do with filesystem code at all. > > It is purely VFS layer operation, which shall not be called from fs > > code at all. vn_start_secondary_write() is sometimes useful for the > > filesystem itself. > > > > Suspension (not snapshotting) is very useful and allows to avoid some > > nasty issues with unmounts, remounts or guaranteed syncing of the > > filesystem. The fact that only UFS utilizes this functionality just > > shows that other filesystem implementors do not care about this > > correctness, or that other filesystems are not maintained. >=20 > I'm sure that when I looked into it only UFS suspension was being > touched by it and it was introduced back in the days when snapshotting > was sanitized. >=20 > So what are the races it is supposed to fix and other filesystems > don't care about? You cannot reliably sync the filesystem when other writers are active. So, for instance, loop over vnodes fsyncing them in unmount code can never= =20 terminate. The same is true for remounts rw->ro. One of the possible solution there is to suspend writers. If unmount is successfull, writer will get a failure from vn_start_write() call, while it will proceed normal if unmount is terminated or not started at all. Another (proper) example of suspension use is gjournal. --o+ErJpKw5D0ndpyV Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9Pj0UACgkQC3+MBN1Mb4gzZACfYeiuRg03EuxoUfK6NjsPNMbx Gn4AoIjglsR1+n6ZBjpK4y2BFXmDd1ly =/m0G -----END PGP SIGNATURE----- --o+ErJpKw5D0ndpyV-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 15:11:18 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2AB101065672; Thu, 1 Mar 2012 15:11:18 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54]) by mx1.freebsd.org (Postfix) with ESMTP id 7A0C48FC12; Thu, 1 Mar 2012 15:11:17 +0000 (UTC) Received: by eekd17 with SMTP id d17so253248eek.13 for ; Thu, 01 Mar 2012 07:11:16 -0800 (PST) Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates 10.112.9.34 as permitted sender) client-ip=10.112.9.34; Authentication-Results: mr.google.com; spf=pass (google.com: domain of asmrookie@gmail.com designates 10.112.9.34 as permitted sender) smtp.mail=asmrookie@gmail.com; dkim=pass header.i=asmrookie@gmail.com Received: from mr.google.com ([10.112.9.34]) by 10.112.9.34 with SMTP id w2mr2545454lba.50.1330614676462 (num_hops = 1); Thu, 01 Mar 2012 07:11:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=7uAs/tmWbzC6/wr3qSuwtfYxQxRiocUfZf7/D4d583M=; b=BQ49iAOcE/voiKi4gCH8gIidDjjfXohLxk18U0k4SxxjijmjSDB1qTzgjFFIq+1MBC b77zOxS78xIGJPwQzxnHjYfzMbU44c9YV184oT0gHVgWreEIEyGf+4ovm/cLrBXQAmQ/ 9T/sPv6rmOqufpBYJebSAsoh/J6skE2t08SVQ= MIME-Version: 1.0 Received: by 10.112.9.34 with SMTP id w2mr2071242lba.50.1330614676363; Thu, 01 Mar 2012 07:11:16 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 07:11:16 -0800 (PST) In-Reply-To: <20120301150125.GX55074@deviant.kiev.zoral.com.ua> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> <20120301150125.GX55074@deviant.kiev.zoral.com.ua> Date: Thu, 1 Mar 2012 15:11:16 +0000 X-Google-Sender-Auth: gJ-0HKxicl_VFuRPeeNGI95A3gU Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 15:11:18 -0000 2012/3/1, Konstantin Belousov : > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: >> 2012/3/1, Konstantin Belousov : >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: >> >> 2012/3/1, Pawel Jakub Dawidek : >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: >> >> >> > - "Every file system needs cache. Let's make it general, so that >> >> >> > all >> >> >> > file >> >> >> > systems can use it!" Well, for VFS each file system is a >> >> >> > separate >> >> >> > entity, which is not the case for ZFS. ZFS can cache one block >> >> >> > only >> >> >> > once that is used by one file system, 10 clones and 100 >> >> >> > snapshots, >> >> >> > which all are separate mount points from VFS perspective. >> >> >> > The same block would be cached 111 times by the buffer cache. >> >> >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call >> >> >> cache_entry() on your own), add a number of cache_prune calls. It's >> >> >> pretty much library-like design you describe below. >> >> > >> >> > Yes, namecache is already library-like, but I was talking about the >> >> > buffer cache. I managed to bypass it eventually with suggestions from >> >> > ups@, but for a long time I was sure it isn't at all possible. >> >> >> >> Can you please clarify on this as I really don't understand what you >> >> mean? >> >> >> >> > >> >> >> Everybody agrees that VFS needs more care. But there haven't been >> >> >> much >> >> >> of concrete suggestions or at least there is no VFS TODO list. >> >> > >> >> > Everybody agrees on that, true, but we disagree on the direction we >> >> > should move our VFS, ie. make it more light-weight vs. more >> >> > heavy-weight. >> >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in >> >> replicating all the vnodes lifecycle at the inode level and in the >> >> filesystem specific implementation. >> >> I don't see a semplification in the work to do, I don't think this is >> >> going to be simpler for a single specific filesystem (without >> >> mentioning the legacy support, which means re-implement inode handling >> >> for every filesystem we have now), we just loose generality. >> >> >> >> if you want a good example of a VFS primitive that was really >> >> UFS-centric and it was mistakenly made generic is vn_start_write() and >> >> sibillings. I guess it was introduced just to cater UFS snapshot >> >> creation and then it poisoned other consumers. >> > >> > vn_start_write() has nothing to do with filesystem code at all. >> > It is purely VFS layer operation, which shall not be called from fs >> > code at all. vn_start_secondary_write() is sometimes useful for the >> > filesystem itself. >> > >> > Suspension (not snapshotting) is very useful and allows to avoid some >> > nasty issues with unmounts, remounts or guaranteed syncing of the >> > filesystem. The fact that only UFS utilizes this functionality just >> > shows that other filesystem implementors do not care about this >> > correctness, or that other filesystems are not maintained. >> >> I'm sure that when I looked into it only UFS suspension was being >> touched by it and it was introduced back in the days when snapshotting >> was sanitized. >> >> So what are the races it is supposed to fix and other filesystems >> don't care about? > > You cannot reliably sync the filesystem when other writers are active. > So, for instance, loop over vnodes fsyncing them in unmount code can never > terminate. The same is true for remounts rw->ro. > > One of the possible solution there is to suspend writers. If unmount is > successfull, writer will get a failure from vn_start_write() call, while > it will proceed normal if unmount is terminated or not started at all. I don't think we implement that right now, IIRC, but it is an interesting idea. Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 15:16:51 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 88E19106566C; Thu, 1 Mar 2012 15:16:51 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id C04938FC1E; Thu, 1 Mar 2012 15:16:50 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21FGhhU097853; Thu, 1 Mar 2012 17:16:43 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q21FGgar075099; Thu, 1 Mar 2012 17:16:42 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21FGglh075098; Thu, 1 Mar 2012 17:16:42 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 1 Mar 2012 17:16:42 +0200 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120301151642.GY55074@deviant.kiev.zoral.com.ua> References: <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> <20120301150125.GX55074@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="k/PDUuKPvLVdBXpq" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 15:16:51 -0000 --k/PDUuKPvLVdBXpq Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote: > 2012/3/1, Konstantin Belousov : > > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: > >> 2012/3/1, Konstantin Belousov : > >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: > >> >> 2012/3/1, Pawel Jakub Dawidek : > >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > >> >> >> > - "Every file system needs cache. Let's make it general, so th= at > >> >> >> > all > >> >> >> > file > >> >> >> > systems can use it!" Well, for VFS each file system is a > >> >> >> > separate > >> >> >> > entity, which is not the case for ZFS. ZFS can cache one blo= ck > >> >> >> > only > >> >> >> > once that is used by one file system, 10 clones and 100 > >> >> >> > snapshots, > >> >> >> > which all are separate mount points from VFS perspective. > >> >> >> > The same block would be cached 111 times by the buffer cache. > >> >> >> > >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call > >> >> >> cache_entry() on your own), add a number of cache_prune calls. I= t's > >> >> >> pretty much library-like design you describe below. > >> >> > > >> >> > Yes, namecache is already library-like, but I was talking about t= he > >> >> > buffer cache. I managed to bypass it eventually with suggestions = from > >> >> > ups@, but for a long time I was sure it isn't at all possible. > >> >> > >> >> Can you please clarify on this as I really don't understand what you > >> >> mean? > >> >> > >> >> > > >> >> >> Everybody agrees that VFS needs more care. But there haven't been > >> >> >> much > >> >> >> of concrete suggestions or at least there is no VFS TODO list. > >> >> > > >> >> > Everybody agrees on that, true, but we disagree on the direction = we > >> >> > should move our VFS, ie. make it more light-weight vs. more > >> >> > heavy-weight. > >> >> > >> >> All I'm saying (and Gleb too) is that I don't see any benefit in > >> >> replicating all the vnodes lifecycle at the inode level and in the > >> >> filesystem specific implementation. > >> >> I don't see a semplification in the work to do, I don't think this = is > >> >> going to be simpler for a single specific filesystem (without > >> >> mentioning the legacy support, which means re-implement inode handl= ing > >> >> for every filesystem we have now), we just loose generality. > >> >> > >> >> if you want a good example of a VFS primitive that was really > >> >> UFS-centric and it was mistakenly made generic is vn_start_write() = and > >> >> sibillings. I guess it was introduced just to cater UFS snapshot > >> >> creation and then it poisoned other consumers. > >> > > >> > vn_start_write() has nothing to do with filesystem code at all. > >> > It is purely VFS layer operation, which shall not be called from fs > >> > code at all. vn_start_secondary_write() is sometimes useful for the > >> > filesystem itself. > >> > > >> > Suspension (not snapshotting) is very useful and allows to avoid some > >> > nasty issues with unmounts, remounts or guaranteed syncing of the > >> > filesystem. The fact that only UFS utilizes this functionality just > >> > shows that other filesystem implementors do not care about this > >> > correctness, or that other filesystems are not maintained. > >> > >> I'm sure that when I looked into it only UFS suspension was being > >> touched by it and it was introduced back in the days when snapshotting > >> was sanitized. > >> > >> So what are the races it is supposed to fix and other filesystems > >> don't care about? > > > > You cannot reliably sync the filesystem when other writers are active. > > So, for instance, loop over vnodes fsyncing them in unmount code can ne= ver > > terminate. The same is true for remounts rw->ro. > > > > One of the possible solution there is to suspend writers. If unmount is > > successfull, writer will get a failure from vn_start_write() call, while > > it will proceed normal if unmount is terminated or not started at all. >=20 > I don't think we implement that right now, IIRC, but it is an interesting= idea. What don't we implement right now ? Take a look at r183074 (Sep 2008). --k/PDUuKPvLVdBXpq Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9PktoACgkQC3+MBN1Mb4gFwQCfaxSZ9pfQ+PsYYQmWry7vDHCp tykAnjplVq3pEMugDE19Yffjtw2mu4j3 =9++M -----END PGP SIGNATURE----- --k/PDUuKPvLVdBXpq-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 15:23:23 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 1963C106567B; Thu, 1 Mar 2012 15:23:23 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-lpp01m010-f54.google.com (mail-lpp01m010-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id 4574B8FC17; Thu, 1 Mar 2012 15:23:22 +0000 (UTC) Received: by lagv3 with SMTP id v3so1197770lag.13 for ; Thu, 01 Mar 2012 07:23:21 -0800 (PST) Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates 10.152.147.202 as permitted sender) client-ip=10.152.147.202; Authentication-Results: mr.google.com; spf=pass (google.com: domain of asmrookie@gmail.com designates 10.152.147.202 as permitted sender) smtp.mail=asmrookie@gmail.com; dkim=pass header.i=asmrookie@gmail.com Received: from mr.google.com ([10.152.147.202]) by 10.152.147.202 with SMTP id tm10mr5390208lab.49.1330615401231 (num_hops = 1); Thu, 01 Mar 2012 07:23:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=AMtp+5UuzzagZLgISIGIj0nfQjIHPYkfsGjrSuQg7vU=; b=w0cUtut36Il4G6fAWRPXVuePP9zEZ7ARRw87na7PoDwR6jgQj9EUtMYZ9QeivffEDI gUR3XTFYgz96JQBBJxjEml4xR5P82j+xXF5zAYFP3lbm7wFrZ9Lcj/mUqX2iQIItlj1n DfoAErn2ntprGaoPbLL1A4h24nXIRX2J3c1EQ= MIME-Version: 1.0 Received: by 10.152.147.202 with SMTP id tm10mr4385514lab.49.1330615401154; Thu, 01 Mar 2012 07:23:21 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Thu, 1 Mar 2012 07:23:21 -0800 (PST) In-Reply-To: <20120301151642.GY55074@deviant.kiev.zoral.com.ua> References: <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> <20120301150125.GX55074@deviant.kiev.zoral.com.ua> <20120301151642.GY55074@deviant.kiev.zoral.com.ua> Date: Thu, 1 Mar 2012 15:23:21 +0000 X-Google-Sender-Auth: uqNAXcAOSIaEColkq71YVStiA18 Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 15:23:23 -0000 2012/3/1, Konstantin Belousov : > On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote: >> 2012/3/1, Konstantin Belousov : >> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: >> >> 2012/3/1, Konstantin Belousov : >> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: >> >> >> 2012/3/1, Pawel Jakub Dawidek : >> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: >> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: >> >> >> >> > - "Every file system needs cache. Let's make it general, so >> >> >> >> > that >> >> >> >> > all >> >> >> >> > file >> >> >> >> > systems can use it!" Well, for VFS each file system is a >> >> >> >> > separate >> >> >> >> > entity, which is not the case for ZFS. ZFS can cache one >> >> >> >> > block >> >> >> >> > only >> >> >> >> > once that is used by one file system, 10 clones and 100 >> >> >> >> > snapshots, >> >> >> >> > which all are separate mount points from VFS perspective. >> >> >> >> > The same block would be cached 111 times by the buffer cache. >> >> >> >> >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call >> >> >> >> cache_entry() on your own), add a number of cache_prune calls. >> >> >> >> It's >> >> >> >> pretty much library-like design you describe below. >> >> >> > >> >> >> > Yes, namecache is already library-like, but I was talking about >> >> >> > the >> >> >> > buffer cache. I managed to bypass it eventually with suggestions >> >> >> > from >> >> >> > ups@, but for a long time I was sure it isn't at all possible. >> >> >> >> >> >> Can you please clarify on this as I really don't understand what you >> >> >> mean? >> >> >> >> >> >> > >> >> >> >> Everybody agrees that VFS needs more care. But there haven't been >> >> >> >> much >> >> >> >> of concrete suggestions or at least there is no VFS TODO list. >> >> >> > >> >> >> > Everybody agrees on that, true, but we disagree on the direction >> >> >> > we >> >> >> > should move our VFS, ie. make it more light-weight vs. more >> >> >> > heavy-weight. >> >> >> >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in >> >> >> replicating all the vnodes lifecycle at the inode level and in the >> >> >> filesystem specific implementation. >> >> >> I don't see a semplification in the work to do, I don't think this >> >> >> is >> >> >> going to be simpler for a single specific filesystem (without >> >> >> mentioning the legacy support, which means re-implement inode >> >> >> handling >> >> >> for every filesystem we have now), we just loose generality. >> >> >> >> >> >> if you want a good example of a VFS primitive that was really >> >> >> UFS-centric and it was mistakenly made generic is vn_start_write() >> >> >> and >> >> >> sibillings. I guess it was introduced just to cater UFS snapshot >> >> >> creation and then it poisoned other consumers. >> >> > >> >> > vn_start_write() has nothing to do with filesystem code at all. >> >> > It is purely VFS layer operation, which shall not be called from fs >> >> > code at all. vn_start_secondary_write() is sometimes useful for the >> >> > filesystem itself. >> >> > >> >> > Suspension (not snapshotting) is very useful and allows to avoid some >> >> > nasty issues with unmounts, remounts or guaranteed syncing of the >> >> > filesystem. The fact that only UFS utilizes this functionality just >> >> > shows that other filesystem implementors do not care about this >> >> > correctness, or that other filesystems are not maintained. >> >> >> >> I'm sure that when I looked into it only UFS suspension was being >> >> touched by it and it was introduced back in the days when snapshotting >> >> was sanitized. >> >> >> >> So what are the races it is supposed to fix and other filesystems >> >> don't care about? >> > >> > You cannot reliably sync the filesystem when other writers are active. >> > So, for instance, loop over vnodes fsyncing them in unmount code can >> > never >> > terminate. The same is true for remounts rw->ro. >> > >> > One of the possible solution there is to suspend writers. If unmount is >> > successfull, writer will get a failure from vn_start_write() call, while >> > it will proceed normal if unmount is terminated or not started at all. >> >> I don't think we implement that right now, IIRC, but it is an interesting >> idea. > > What don't we implement right now ? Take a look at r183074 (Sep 2008). Ah sorry, I looked into it before 2008 effectively (and that also reminds me why I stopped working on removing that primitive from VFS and make it UFS specific one) :) However why we cannot make a fix like that in domount()/dounmount() directly for every R/W filesystem? Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 15:35:58 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B2A4E1065676; Thu, 1 Mar 2012 15:35:58 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mail.zoral.com.ua (mx0.zoral.com.ua [91.193.166.200]) by mx1.freebsd.org (Postfix) with ESMTP id 0C5398FC1E; Thu, 1 Mar 2012 15:35:57 +0000 (UTC) Received: from skuns.kiev.zoral.com.ua (localhost [127.0.0.1]) by mail.zoral.com.ua (8.14.2/8.14.2) with ESMTP id q21FZfl6099329; Thu, 1 Mar 2012 17:35:41 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: from deviant.kiev.zoral.com.ua (kostik@localhost [127.0.0.1]) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5) with ESMTP id q21FZfFQ075215; Thu, 1 Mar 2012 17:35:41 +0200 (EET) (envelope-from kostikbel@gmail.com) Received: (from kostik@localhost) by deviant.kiev.zoral.com.ua (8.14.5/8.14.5/Submit) id q21FZfFu075214; Thu, 1 Mar 2012 17:35:41 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: deviant.kiev.zoral.com.ua: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 1 Mar 2012 17:35:41 +0200 From: Konstantin Belousov To: Attilio Rao Message-ID: <20120301153541.GZ55074@deviant.kiev.zoral.com.ua> References: <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> <20120301150125.GX55074@deviant.kiev.zoral.com.ua> <20120301151642.GY55074@deviant.kiev.zoral.com.ua> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="JjEBCMAGNkRv8xbT" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-Virus-Scanned: clamav-milter 0.95.2 at skuns.kiev.zoral.com.ua X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=5.0 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on skuns.kiev.zoral.com.ua Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 15:35:58 -0000 --JjEBCMAGNkRv8xbT Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Mar 01, 2012 at 03:23:21PM +0000, Attilio Rao wrote: > 2012/3/1, Konstantin Belousov : > > On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote: > >> 2012/3/1, Konstantin Belousov : > >> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: > >> >> 2012/3/1, Konstantin Belousov : > >> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: > >> >> >> 2012/3/1, Pawel Jakub Dawidek : > >> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > >> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: > >> >> >> >> > - "Every file system needs cache. Let's make it general, so > >> >> >> >> > that > >> >> >> >> > all > >> >> >> >> > file > >> >> >> >> > systems can use it!" Well, for VFS each file system is a > >> >> >> >> > separate > >> >> >> >> > entity, which is not the case for ZFS. ZFS can cache one > >> >> >> >> > block > >> >> >> >> > only > >> >> >> >> > once that is used by one file system, 10 clones and 100 > >> >> >> >> > snapshots, > >> >> >> >> > which all are separate mount points from VFS perspective. > >> >> >> >> > The same block would be cached 111 times by the buffer ca= che. > >> >> >> >> > >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call > >> >> >> >> cache_entry() on your own), add a number of cache_prune calls. > >> >> >> >> It's > >> >> >> >> pretty much library-like design you describe below. > >> >> >> > > >> >> >> > Yes, namecache is already library-like, but I was talking about > >> >> >> > the > >> >> >> > buffer cache. I managed to bypass it eventually with suggestio= ns > >> >> >> > from > >> >> >> > ups@, but for a long time I was sure it isn't at all possible. > >> >> >> > >> >> >> Can you please clarify on this as I really don't understand what= you > >> >> >> mean? > >> >> >> > >> >> >> > > >> >> >> >> Everybody agrees that VFS needs more care. But there haven't = been > >> >> >> >> much > >> >> >> >> of concrete suggestions or at least there is no VFS TODO list. > >> >> >> > > >> >> >> > Everybody agrees on that, true, but we disagree on the directi= on > >> >> >> > we > >> >> >> > should move our VFS, ie. make it more light-weight vs. more > >> >> >> > heavy-weight. > >> >> >> > >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in > >> >> >> replicating all the vnodes lifecycle at the inode level and in t= he > >> >> >> filesystem specific implementation. > >> >> >> I don't see a semplification in the work to do, I don't think th= is > >> >> >> is > >> >> >> going to be simpler for a single specific filesystem (without > >> >> >> mentioning the legacy support, which means re-implement inode > >> >> >> handling > >> >> >> for every filesystem we have now), we just loose generality. > >> >> >> > >> >> >> if you want a good example of a VFS primitive that was really > >> >> >> UFS-centric and it was mistakenly made generic is vn_start_write= () > >> >> >> and > >> >> >> sibillings. I guess it was introduced just to cater UFS snapshot > >> >> >> creation and then it poisoned other consumers. > >> >> > > >> >> > vn_start_write() has nothing to do with filesystem code at all. > >> >> > It is purely VFS layer operation, which shall not be called from = fs > >> >> > code at all. vn_start_secondary_write() is sometimes useful for t= he > >> >> > filesystem itself. > >> >> > > >> >> > Suspension (not snapshotting) is very useful and allows to avoid = some > >> >> > nasty issues with unmounts, remounts or guaranteed syncing of the > >> >> > filesystem. The fact that only UFS utilizes this functionality ju= st > >> >> > shows that other filesystem implementors do not care about this > >> >> > correctness, or that other filesystems are not maintained. > >> >> > >> >> I'm sure that when I looked into it only UFS suspension was being > >> >> touched by it and it was introduced back in the days when snapshott= ing > >> >> was sanitized. > >> >> > >> >> So what are the races it is supposed to fix and other filesystems > >> >> don't care about? > >> > > >> > You cannot reliably sync the filesystem when other writers are activ= e. > >> > So, for instance, loop over vnodes fsyncing them in unmount code can > >> > never > >> > terminate. The same is true for remounts rw->ro. > >> > > >> > One of the possible solution there is to suspend writers. If unmount= is > >> > successfull, writer will get a failure from vn_start_write() call, w= hile > >> > it will proceed normal if unmount is terminated or not started at al= l. > >> > >> I don't think we implement that right now, IIRC, but it is an interest= ing > >> idea. > > > > What don't we implement right now ? Take a look at r183074 (Sep 2008). >=20 > Ah sorry, I looked into it before 2008 effectively (and that also > reminds me why I stopped working on removing that primitive from VFS > and make it UFS specific one) :) >=20 > However why we cannot make a fix like that in domount()/dounmount() > directly for every R/W filesystem? At least, the filesystem needs to implement the VFS_SUSP_CLEAN VFS op. The purpose of the operation is to clean up after suspension, e.g. in the UFS case, VFS_SUSP_CLEAN removes unlinked files which reference count went to 0 during suspension, as well as process delayed atime updating. Another issue that I see is handling of filesystems that offload i/o to several threads. The unmount thread is given special rights to perform i/o while filesystem is suspended, but VFS cannot know about other threads that shall be permitted to perform writes. At least those are two issues that appeared during applying the suspension to UFS unmount and which I remember. With all this complications, suspension is provided in a form of library for use by filesystem implementors, and not as a mandatory feature of VFS. --JjEBCMAGNkRv8xbT Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iEYEARECAAYFAk9Pl00ACgkQC3+MBN1Mb4i61gCfbNMsO6TQXa6gYB73u/0gKYjf leIAnRYbWi3DKaiOQD1fRnXzYM/gxM3b =h3Yh -----END PGP SIGNATURE----- --JjEBCMAGNkRv8xbT-- From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 16:45:12 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D9866106564A; Thu, 1 Mar 2012 16:45:12 +0000 (UTC) (envelope-from gleb.kurtsou@gmail.com) Received: from mail-ee0-f54.google.com (mail-ee0-f54.google.com [74.125.83.54]) by mx1.freebsd.org (Postfix) with ESMTP id 974E58FC1A; Thu, 1 Mar 2012 16:45:11 +0000 (UTC) Received: by eekd17 with SMTP id d17so314788eek.13 for ; Thu, 01 Mar 2012 08:45:10 -0800 (PST) Received-SPF: pass (google.com: domain of gleb.kurtsou@gmail.com designates 10.112.84.1 as permitted sender) client-ip=10.112.84.1; Authentication-Results: mr.google.com; spf=pass (google.com: domain of gleb.kurtsou@gmail.com designates 10.112.84.1 as permitted sender) smtp.mail=gleb.kurtsou@gmail.com; dkim=pass header.i=gleb.kurtsou@gmail.com Received: from mr.google.com ([10.112.84.1]) by 10.112.84.1 with SMTP id u1mr2739670lby.35.1330620310745 (num_hops = 1); Thu, 01 Mar 2012 08:45:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=7S898NWJKgzv2C1Ylz/B72zxBZnpVCk2CQ3ZJMPYCJY=; b=cR6WU+teca3cDe8EeyS4nc9Xkk7XAH2XX09zhYQate7WojlLzWZmU6aM1kLV2EPkm+ JUbhxmBHvJ3Il9sYRRs1Uq3DuzbsoqEcq54noXuvoUPevQGY5BQQzkF1KIOrPUQAYNmH AqWfdvrQYLokuUAwrLN+CDZYPaKAqN5ksZpsQ= Received: by 10.112.84.1 with SMTP id u1mr2248604lby.35.1330620310553; Thu, 01 Mar 2012 08:45:10 -0800 (PST) Received: from localhost ([78.157.92.5]) by mx.google.com with ESMTPS id f2sm3661105lbw.5.2012.03.01.08.45.09 (version=SSLv3 cipher=OTHER); Thu, 01 Mar 2012 08:45:09 -0800 (PST) Date: Thu, 1 Mar 2012 18:45:11 +0200 From: Gleb Kurtsou To: Christoph Hellwig Message-ID: <20120301164511.GA1501@reks> References: <20120203193719.GB3283@deviant.kiev.zoral.com.ua> <20120225151334.GH1344@garage.freebsd.pl> <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141010.GA7079@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20120301141010.GA7079@infradead.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: Attilio Rao , Konstantin Belousov , Pawel Jakub Dawidek , arch@freebsd.org Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 16:45:12 -0000 On (01/03/2012 09:10), Christoph Hellwig wrote: > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: > > Are you aware of a real "libraries for file systems" VFS example? It > > sounds very interesting but I'm afraid it's going to look good only in > > theory. E.g. locking at file system level (Darwin, Dragonfly BSD) looks > > rather messy (IMHO) and more likely to be bug prone. On the other side > > Linux has optional per file system rename lock making VOP_RENAME > > implementation much easier, while ours is tremendously difficult to do > > right. > > All namespace locking in Linux is in the VFS, and it mandatory. A > filesystem wide lock is only used for cross-directory renames. > > A more detailed description is here: > > http://git.kernel.dk/?p=linux.git;a=blob;f=Documentation/filesystems/directory-locking > My bad. I thought s_vfs_rename_mutex can be optional. Quite unfortunate linux doesn't support concurrent cross-directory renames :) Thanks, Gleb. From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 17:05:02 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4E141106564A; Thu, 1 Mar 2012 17:05:02 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-vx0-f182.google.com (mail-vx0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id C9B5A8FC13; Thu, 1 Mar 2012 17:05:01 +0000 (UTC) Received: by vcbfl15 with SMTP id fl15so821083vcb.13 for ; Thu, 01 Mar 2012 09:05:01 -0800 (PST) Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates 10.52.99.169 as permitted sender) client-ip=10.52.99.169; Authentication-Results: mr.google.com; spf=pass (google.com: domain of asmrookie@gmail.com designates 10.52.99.169 as permitted sender) smtp.mail=asmrookie@gmail.com; dkim=pass header.i=asmrookie@gmail.com Received: from mr.google.com ([10.52.99.169]) by 10.52.99.169 with SMTP id er9mr9140528vdb.126.1330621501308 (num_hops = 1); Thu, 01 Mar 2012 09:05:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=FFAhnUi55uRUY3tqk6a/YHI48a5bXe4jvv+rNuyj3uE=; b=Gxm0UCPxGjPU38mVxYBbylHdr+PPePT9zXJtuT9+AIFf/R0ywwFXg5NwaTk92Lri+h B2eqnYlZkxJDoW7E7xjUE0uXSDtwNbCJqJlogeTR6jLV2Ij2vGnv4OWBynHr0w6z73TO QoRoGLyR6Yp0uwAHrQ3o1VDG4UHpZZLGydWgY= MIME-Version: 1.0 Received: by 10.52.99.169 with SMTP id er9mr7754144vdb.126.1330621501096; Thu, 01 Mar 2012 09:05:01 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.220.38.72 with HTTP; Thu, 1 Mar 2012 09:05:00 -0800 (PST) In-Reply-To: <20120301153541.GZ55074@deviant.kiev.zoral.com.ua> References: <20120225194630.GI1344@garage.freebsd.pl> <20120301111624.GB30991@reks> <20120301141247.GE1336@garage.freebsd.pl> <20120301144708.GV55074@deviant.kiev.zoral.com.ua> <20120301150125.GX55074@deviant.kiev.zoral.com.ua> <20120301151642.GY55074@deviant.kiev.zoral.com.ua> <20120301153541.GZ55074@deviant.kiev.zoral.com.ua> Date: Thu, 1 Mar 2012 17:05:00 +0000 X-Google-Sender-Auth: o5p0MzBSUTUxIxtcQxgyi0p3mjw Message-ID: From: Attilio Rao To: Konstantin Belousov Content-Type: text/plain; charset=UTF-8 Cc: arch@freebsd.org, Gleb Kurtsou , Pawel Jakub Dawidek Subject: Re: Prefaulting for i/o buffers X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 17:05:02 -0000 2012/3/1, Konstantin Belousov : > On Thu, Mar 01, 2012 at 03:23:21PM +0000, Attilio Rao wrote: >> 2012/3/1, Konstantin Belousov : >> > On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote: >> >> 2012/3/1, Konstantin Belousov : >> >> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote: >> >> >> 2012/3/1, Konstantin Belousov : >> >> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote: >> >> >> >> 2012/3/1, Pawel Jakub Dawidek : >> >> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote: >> >> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote: >> >> >> >> >> > - "Every file system needs cache. Let's make it general, so >> >> >> >> >> > that >> >> >> >> >> > all >> >> >> >> >> > file >> >> >> >> >> > systems can use it!" Well, for VFS each file system is a >> >> >> >> >> > separate >> >> >> >> >> > entity, which is not the case for ZFS. ZFS can cache one >> >> >> >> >> > block >> >> >> >> >> > only >> >> >> >> >> > once that is used by one file system, 10 clones and 100 >> >> >> >> >> > snapshots, >> >> >> >> >> > which all are separate mount points from VFS perspective. >> >> >> >> >> > The same block would be cached 111 times by the buffer >> >> >> >> >> > cache. >> >> >> >> >> >> >> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call >> >> >> >> >> cache_entry() on your own), add a number of cache_prune calls. >> >> >> >> >> It's >> >> >> >> >> pretty much library-like design you describe below. >> >> >> >> > >> >> >> >> > Yes, namecache is already library-like, but I was talking about >> >> >> >> > the >> >> >> >> > buffer cache. I managed to bypass it eventually with >> >> >> >> > suggestions >> >> >> >> > from >> >> >> >> > ups@, but for a long time I was sure it isn't at all possible. >> >> >> >> >> >> >> >> Can you please clarify on this as I really don't understand what >> >> >> >> you >> >> >> >> mean? >> >> >> >> >> >> >> >> > >> >> >> >> >> Everybody agrees that VFS needs more care. But there haven't >> >> >> >> >> been >> >> >> >> >> much >> >> >> >> >> of concrete suggestions or at least there is no VFS TODO list. >> >> >> >> > >> >> >> >> > Everybody agrees on that, true, but we disagree on the >> >> >> >> > direction >> >> >> >> > we >> >> >> >> > should move our VFS, ie. make it more light-weight vs. more >> >> >> >> > heavy-weight. >> >> >> >> >> >> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in >> >> >> >> replicating all the vnodes lifecycle at the inode level and in >> >> >> >> the >> >> >> >> filesystem specific implementation. >> >> >> >> I don't see a semplification in the work to do, I don't think >> >> >> >> this >> >> >> >> is >> >> >> >> going to be simpler for a single specific filesystem (without >> >> >> >> mentioning the legacy support, which means re-implement inode >> >> >> >> handling >> >> >> >> for every filesystem we have now), we just loose generality. >> >> >> >> >> >> >> >> if you want a good example of a VFS primitive that was really >> >> >> >> UFS-centric and it was mistakenly made generic is >> >> >> >> vn_start_write() >> >> >> >> and >> >> >> >> sibillings. I guess it was introduced just to cater UFS snapshot >> >> >> >> creation and then it poisoned other consumers. >> >> >> > >> >> >> > vn_start_write() has nothing to do with filesystem code at all. >> >> >> > It is purely VFS layer operation, which shall not be called from >> >> >> > fs >> >> >> > code at all. vn_start_secondary_write() is sometimes useful for >> >> >> > the >> >> >> > filesystem itself. >> >> >> > >> >> >> > Suspension (not snapshotting) is very useful and allows to avoid >> >> >> > some >> >> >> > nasty issues with unmounts, remounts or guaranteed syncing of the >> >> >> > filesystem. The fact that only UFS utilizes this functionality >> >> >> > just >> >> >> > shows that other filesystem implementors do not care about this >> >> >> > correctness, or that other filesystems are not maintained. >> >> >> >> >> >> I'm sure that when I looked into it only UFS suspension was being >> >> >> touched by it and it was introduced back in the days when >> >> >> snapshotting >> >> >> was sanitized. >> >> >> >> >> >> So what are the races it is supposed to fix and other filesystems >> >> >> don't care about? >> >> > >> >> > You cannot reliably sync the filesystem when other writers are >> >> > active. >> >> > So, for instance, loop over vnodes fsyncing them in unmount code can >> >> > never >> >> > terminate. The same is true for remounts rw->ro. >> >> > >> >> > One of the possible solution there is to suspend writers. If unmount >> >> > is >> >> > successfull, writer will get a failure from vn_start_write() call, >> >> > while >> >> > it will proceed normal if unmount is terminated or not started at >> >> > all. >> >> >> >> I don't think we implement that right now, IIRC, but it is an >> >> interesting >> >> idea. >> > >> > What don't we implement right now ? Take a look at r183074 (Sep 2008). >> >> Ah sorry, I looked into it before 2008 effectively (and that also >> reminds me why I stopped working on removing that primitive from VFS >> and make it UFS specific one) :) >> >> However why we cannot make a fix like that in domount()/dounmount() >> directly for every R/W filesystem? > At least, the filesystem needs to implement the VFS_SUSP_CLEAN VFS op. > The purpose of the operation is to clean up after suspension, e.g. > in the UFS case, VFS_SUSP_CLEAN removes unlinked files which reference > count went to 0 during suspension, as well as process delayed atime > updating. > > Another issue that I see is handling of filesystems that offload i/o to > several threads. The unmount thread is given special rights to perform > i/o while filesystem is suspended, but VFS cannot know about other threads > that shall be permitted to perform writes. > > At least those are two issues that appeared during applying the suspension > to UFS unmount and which I remember. > > With all this complications, suspension is provided in a form of library > for use by filesystem implementors, and not as a mandatory feature of VFS. It makes sense, thanks for explaining the issues you found while implementing this trick on UFS. Attilio -- Peace can only be achieved by understanding - A. Einstein From owner-freebsd-arch@FreeBSD.ORG Fri Mar 2 17:28:58 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 460BD1065670 for ; Fri, 2 Mar 2012 17:28:58 +0000 (UTC) (envelope-from johnandsara2@cox.net) Received: from eastrmfepi108.cox.net (eastrmfepi108.cox.net [68.230.241.204]) by mx1.freebsd.org (Postfix) with ESMTP id D4C068FC12 for ; Fri, 2 Mar 2012 17:28:57 +0000 (UTC) Received: from eastrmimpo210.cox.net ([68.230.241.225]) by eastrmfepo101.cox.net (InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id <20120302170314.VUUA18243.eastrmfepo101.cox.net@eastrmimpo210.cox.net> for ; Fri, 2 Mar 2012 12:03:14 -0500 Received: from [192.168.3.22] ([70.177.172.35]) by eastrmimpo210.cox.net with bizsmtp id gh3D1i0030mAvba02h3DbX; Fri, 02 Mar 2012 12:03:13 -0500 X-CT-Class: Clean X-CT-Score: 0.00 X-CT-RefID: str=0001.0A02020A.4F50FD51.01D4,ss=1,re=0.000,fgs=0 X-CT-Spam: 0 X-Authority-Analysis: v=1.1 cv=SwD/Y8GpRdONdm5z1I4vXlgMxpglwSfl+jzqXqLOMWM= c=1 sm=1 a=f5xKl4ys9bwA:10 a=j9h4hM69ZBMA:10 a=G8Uczd0VNMoA:10 a=Wajolswj7cQA:10 a=8nJEP1OIZ-IA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:17 a=FP58Ms26AAAA:8 a=efqDOYLTNkgstY12pRcA:9 a=oRuVXIjJWgrHfGDOSYoA:7 a=wPNLvfGTeEIA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:117 X-CM-Score: 0.00 Authentication-Results: cox.net; none Message-ID: <4F50FD4D.9000106@cox.net> Date: Fri, 02 Mar 2012 12:03:09 -0500 From: "John D. Hendrickson and Sara Darnell" User-Agent: Thunderbird 2.0.0.24 (X11/20100228) MIME-Version: 1.0 To: freebsd-arch@freebsd.org References: <4E18ABB1.4010304@cox.net> <20110709194639.GA4914@elie> <4E18EE60.7010402@cox.net> <20110710151354.GA25475@r500-debian> In-Reply-To: <20110710151354.GA25475@r500-debian> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: dep-trace v. tsort (mac ports depends support) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: johnandsara2@cox.net List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Mar 2012 17:28:58 -0000 Hi, BSD and Apple needs tsort(1) for portage still I believe. Topological sorting isn't quite right packaging. Please see: http://sourceforge.net/projects/dep-trace It is a "drop-in" replacement (operates like a /bin/tsort) but is right for pkg depends (ie, for portage: you need to dl source, order of compile may be required, sometimes gets missing message or "loop in depends" message when attempting to compile and install pkg) I'm a debian user but i wish I had a bsd machine :) So i do not know allot of BSD maintainer / mailing list specifics. Please give me a handicap there ! Thanks and thanks again, John p.s. (dep-trace itself has no depends (a /bin), has improvements, and is "more hackable" than tsort as to coding new ordering rules against lists - which in tsort "loop detected attempting to recover" is not as easy i feel. From owner-freebsd-arch@FreeBSD.ORG Fri Mar 2 17:38:51 2012 Return-Path: Delivered-To: arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B46EE106566B for ; Fri, 2 Mar 2012 17:38:51 +0000 (UTC) (envelope-from das@FreeBSD.ORG) Received: from zim.MIT.EDU (ZIM.MIT.EDU [18.95.3.101]) by mx1.freebsd.org (Postfix) with ESMTP id 734898FC17 for ; Fri, 2 Mar 2012 17:38:51 +0000 (UTC) Received: from zim.MIT.EDU (localhost [127.0.0.1]) by zim.MIT.EDU (8.14.5/8.14.2) with ESMTP id q22HGl6G030007; Fri, 2 Mar 2012 12:16:47 -0500 (EST) (envelope-from das@FreeBSD.ORG) Received: (from das@localhost) by zim.MIT.EDU (8.14.5/8.14.2/Submit) id q22HGlXU030006; Fri, 2 Mar 2012 12:16:47 -0500 (EST) (envelope-from das@FreeBSD.ORG) Date: Fri, 2 Mar 2012 12:16:47 -0500 From: David Schultz To: Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= Message-ID: <20120302171647.GA29850@zim.MIT.EDU> Mail-Followup-To: Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?= , Garrett Wollman , arch@freebsd.org References: <4F3C2D2D.5000402@FreeBSD.org> <4F3E78BA.4060203@FreeBSD.org> <864nupcuvl.fsf@ds4.des.no> <4F3E7B5A.20103@FreeBSD.org> <86zkchbff6.fsf@ds4.des.no> <4F3EADB5.7060008@FreeBSD.org> <20120223170918.GA79013@zim.MIT.EDU> <201202231822.q1NIMQOd020804@hergotha.csail.mit.edu> <201202231926.q1NJQPFa021654@hergotha.csail.mit.edu> <86d3958cqi.fsf@ds4.des.no> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <86d3958cqi.fsf@ds4.des.no> Cc: arch@FreeBSD.ORG, Garrett Wollman Subject: Re: bsd/citrus iconv X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Mar 2012 17:38:51 -0000 On Thu, Feb 23, 2012, Dag-Erling Sm?rgrav wrote: > Garrett Wollman writes: > > You missed the bit on the next page: > > > > It is unspecified whether the libraries libc.a, libm.a, > > librt.a, libpthread.a, libl.a, liby.a, or libxnet exist as > > regular files. The implementation may accept as -l operands > > names of objects that do not exist as regular files. > > That's entirely academic unless you want to modify gcc and clang to > automatically pull in libiconv. The point is that if the iconv > extension is implemented, it must be available without requiring > additional -l options. If the linker included libiconv automatically, would it be possible to switch iconv implementations without recompiling, by using libmap.conf? Or is the ABI (e.g., type of iconv_t) incompatible? If the ABI is different, then we might as well stick iconv in libc using weak symbols. > It all boils down to this: do we aspire to SUS conformance? I think it actually boils down to what the practical benefit is. Does it create a compatibility nightmare for apps to have to use the -liconv flag? Do other platforms require it? IIRC, we've been patching ports to include the flag for years. From owner-freebsd-arch@FreeBSD.ORG Fri Mar 2 18:31:41 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2BF2C106566B for ; Fri, 2 Mar 2012 18:31:41 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from mx0.hoeg.nl (mx0.hoeg.nl [IPv6:2a01:4f8:101:5343::aa]) by mx1.freebsd.org (Postfix) with ESMTP id B86FB8FC14 for ; Fri, 2 Mar 2012 18:31:40 +0000 (UTC) Received: by mx0.hoeg.nl (Postfix, from userid 1000) id EE4AB2A28CEE; Fri, 2 Mar 2012 19:31:38 +0100 (CET) Date: Fri, 2 Mar 2012 19:31:38 +0100 From: Ed Schouten To: "John D. Hendrickson and Sara Darnell" Message-ID: <20120302183138.GC32748@hoeg.nl> References: <4E18ABB1.4010304@cox.net> <20110709194639.GA4914@elie> <4E18EE60.7010402@cox.net> <20110710151354.GA25475@r500-debian> <4F50FD4D.9000106@cox.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="cN+O50sc7gZAK+8F" Content-Disposition: inline In-Reply-To: <4F50FD4D.9000106@cox.net> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-arch@freebsd.org Subject: Re: dep-trace v. tsort (mac ports depends support) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Mar 2012 18:31:41 -0000 --cN+O50sc7gZAK+8F Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi John, * John D. Hendrickson and Sara Darnell , 20120302 18:= 03: > BSD and Apple needs tsort(1) for portage still I believe. > > Topological sorting isn't quite right packaging. > > [...] > > (ie, for portage: you need to dl source, order of compile may be > required, sometimes gets missing message or "loop in depends" message > when attempting to compile and install pkg) But wait. Isn't this because of mis-use of tsort(1) by portage? tsort(1) can give you any ordering you like, as long as you make sure your input graph is correct. --=20 Ed Schouten WWW: http://80386.nl/ --cN+O50sc7gZAK+8F Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iQIcBAEBAgAGBQJPURIKAAoJEG5e2P40kaK7pTIQAKKfXXfZPI6k8LKV/TBW09f9 Kzv4wWbbSkNn/1kQ/1VYIBKGIKkuIP6kTjpZ1DrlpfrTTt99iVj329rNrrwrZZ6W AUKnLkA7ddy4/sqRcMCeV0m8Z1QkCprgeVFuQ+Fr9RYIaVEhJkuE0FXTEJuctZ0k Ol30KCrPVeCoNY89iarXv3/DKOKmWbF7MAVXtTXt2ucL8oWWAu/nLsXd5MFQi4HF vjOi/D8nVaP4p/2fssPEBBT2U37LVwKL9uVcHm6FhJByJeA4GvcOgJTGUEpLvlSU STn7BRV9NFBsd07Kaid8csPO7HTfCLYMpZuy6wsfLHjX/ghjNLs6DF8n/A9x6WhA MK8DXyGGcUY8V2YTZ0EmWvurryi/RfcFsWOAj3/BtKUw158lnIn293FFGiitdegt ZEI1Y7P7Ap2H3tihEQ5JLjk6xaAHPWvSaWb4oISd48V9kkLC0TJH+sDuyww6/CCD ZR0ZnrWhY/4ptkcNuIDb+xxtJinqw2lFAut6I4HP0SYb0ehQmhYzLDsOz9vxKGUL +Fgh4fdZHiiEF2KNY0iY62plaGMJkqncpb4ecRUXShibuMBeEqu4EifofSwFuOK7 4pwo1whiS/SFBgzF5vacERVn7WUOXir0ipt7Qu7zD4ZrlaaaQmV6VnL4NkdY9KRu dtwgAuIADMuljFJk8PtC =x52f -----END PGP SIGNATURE----- --cN+O50sc7gZAK+8F-- From owner-freebsd-arch@FreeBSD.ORG Fri Mar 2 22:25:30 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2B1641065678 for ; Fri, 2 Mar 2012 22:25:30 +0000 (UTC) (envelope-from jilles@stack.nl) Received: from mx1.stack.nl (relay04.stack.nl [IPv6:2001:610:1108:5010::107]) by mx1.freebsd.org (Postfix) with ESMTP id 78A5C8FC26 for ; Fri, 2 Mar 2012 22:25:29 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id 350B41DD633; Fri, 2 Mar 2012 23:25:27 +0100 (CET) Received: by snail.stack.nl (Postfix, from userid 1677) id 1B3C728470; Fri, 2 Mar 2012 23:25:27 +0100 (CET) Date: Fri, 2 Mar 2012 23:25:27 +0100 From: Jilles Tjoelker To: Sergey Kandaurov Message-ID: <20120302222526.GB6416@stack.nl> References: <4F4DC876.3010809@delphij.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Cc: d@delphij.net, freebsd-arch@freebsd.org Subject: Re: RFC: futimens(2) and utimensat(2) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Mar 2012 22:25:30 -0000 On Wed, Feb 29, 2012 at 02:21:23PM +0300, Sergey Kandaurov wrote: > On 29 February 2012 10:40, Xin Li wrote: > > -----BEGIN PGP SIGNED MESSAGE----- > > Hash: SHA256 > > These are required by IEEE Std 1003.1-2008. ?Patchset at: > > http://people.freebsd.org/~delphij/for_review/utimens.diff > First, thank you very much for doing this. > ERRORS section for utimes(2) is still not updated (not exists). > Funny but that was the most difficult part to implement these > syscalls a year ago with the great help from jilles@. > He could further comment on your patchset. > Otherwise looks good and pretty similar to my work, though > I didn't use a "const" modifier in my version for both functions > and syscall definitions in syscall.master for some reasons. > Further I wrote a test to see how properly implementation detects > EACCES/EPERM with different UTIME_OMIT/UTIME_NOW passed. It shall pass > all tests as shown in the table (stolen somewhere from austingroup): > [a] [b] [c] > times file file > arg. UID is > NULL owner writable Result > !NULL !owner !writable > > N o w success > N o !w success > N ! w success > N !o !w EACCES [1] > !N o w success > !N o !w success > !N !o w EPERM [2] > !N !o !w EPERM [3] > Here NULL also covers cases when: > - both fields are UTIME_NULL > - both fields are UTIME_OMIT. If both fields are UTIME_NOW, this shall be the same as a NULL pointer. If both fields are UTIME_OMIT, the timestamps remain unchanged; no permission check shall be performed for the file itself but may be performed for the path prefix (an earlier patch from pluknet returned success immediately). Otherwise, the above is correct. Note that if one field is UTIME_NOW and the other is UTIME_OMIT, there is no special case: the caller must be owner or root. -- Jilles Tjoelker From owner-freebsd-arch@FreeBSD.ORG Sat Mar 3 06:24:44 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5845E1065674; Sat, 3 Mar 2012 06:24:44 +0000 (UTC) (envelope-from tim@kientzle.com) Received: from monday.kientzle.com (99-115-135-74.uvs.sntcca.sbcglobal.net [99.115.135.74]) by mx1.freebsd.org (Postfix) with ESMTP id 3064C8FC0C; Sat, 3 Mar 2012 06:24:44 +0000 (UTC) Received: (from root@localhost) by monday.kientzle.com (8.14.4/8.14.4) id q235PWPE069176; Sat, 3 Mar 2012 05:25:32 GMT (envelope-from tim@kientzle.com) Received: from [192.168.2.119] (CiscoE3000 [192.168.1.65]) by kientzle.com with SMTP id najztczxenqiu28mc823p34z56; Sat, 03 Mar 2012 05:25:32 +0000 (UTC) (envelope-from tim@kientzle.com) Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: text/plain; charset=windows-1252 From: Tim Kientzle In-Reply-To: <20120302171647.GA29850@zim.MIT.EDU> Date: Fri, 2 Mar 2012 21:25:32 -0800 Content-Transfer-Encoding: quoted-printable Message-Id: References: <4F3C2D2D.5000402@FreeBSD.org> <4F3E78BA.4060203@FreeBSD.org> <864nupcuvl.fsf@ds4.des.no> <4F3E7B5A.20103@FreeBSD.org> <86zkchbff6.fsf@ds4.des.no> <4F3EADB5.7060008@FreeBSD.org> <20120223170918.GA79013@zim.MIT.EDU> <201202231822.q1NIMQOd020804@hergotha.csail.mit.edu> <201202231926.q1NJQPFa021654@hergotha.csail.mit.edu> <86d3958cqi.fsf@ds4.des.no> <20120302171647.GA29850@zim.MIT.EDU> To: David Schultz X-Mailer: Apple Mail (2.1257) Cc: =?iso-8859-1?Q?Dag-Erling_Sm=F8rgrav?= , Garrett Wollman , arch@freebsd.org Subject: Re: bsd/citrus iconv X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Mar 2012 06:24:44 -0000 On Mar 2, 2012, at 9:16 AM, David Schultz wrote: > On Thu, Feb 23, 2012, Dag-Erling Sm?rgrav wrote: >> Garrett Wollman writes: >>> You missed the bit on the next page: >>>=20 >>> It is unspecified whether the libraries libc.a, libm.a, >>> librt.a, libpthread.a, libl.a, liby.a, or libxnet exist as >>> regular files. The implementation may accept as -l operands >>> names of objects that do not exist as regular files. >>=20 >> That's entirely academic unless you want to modify gcc and clang to >> automatically pull in libiconv. The point is that if the iconv >> extension is implemented, it must be available without requiring >> additional -l options. >=20 > If the linker included libiconv automatically, would it be > possible to switch iconv implementations without recompiling, by > using libmap.conf? Or is the ABI (e.g., type of iconv_t) > incompatible? =85. Very incompatible. The functions actually have different names in the library. So switching implementations via library mapping is not going to work. Tim From owner-freebsd-arch@FreeBSD.ORG Sat Mar 3 12:48:12 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B2F8A106564A; Sat, 3 Mar 2012 12:48:12 +0000 (UTC) (envelope-from rmh.aybabtu@gmail.com) Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by mx1.freebsd.org (Postfix) with ESMTP id D17548FC17; Sat, 3 Mar 2012 12:48:11 +0000 (UTC) Received: by wibhn6 with SMTP id hn6so1448630wib.13 for ; Sat, 03 Mar 2012 04:48:11 -0800 (PST) Received-SPF: pass (google.com: domain of rmh.aybabtu@gmail.com designates 10.180.99.100 as permitted sender) client-ip=10.180.99.100; Authentication-Results: mr.google.com; spf=pass (google.com: domain of rmh.aybabtu@gmail.com designates 10.180.99.100 as permitted sender) smtp.mail=rmh.aybabtu@gmail.com; dkim=pass header.i=rmh.aybabtu@gmail.com Received: from mr.google.com ([10.180.99.100]) by 10.180.99.100 with SMTP id ep4mr3839297wib.7.1330778891004 (num_hops = 1); Sat, 03 Mar 2012 04:48:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=hah0jQlc7/2D5VHQmuISp5eIqbGpXACkxnud/Wcy1u4=; b=F4Z6R2hIvCENnlUaKswTirjNOyUXjJmITpooBMt3xUNAJUdtaExLkpOYfTsee0a08E ZxoWD7E/hS75fmiESEe7p4Z5mNQFo2a8w+AiYXMRLjx2L/IDlOLIYHA+5UZV487scE4y MgekF9x4ED15Z8Xm54la/KXC9zFIidfOS6JvEy6k+/8bf73Ll+c/LbJ8iKhhLfdS2w51 p+qTXtqhOo08UNBZ5OndBL6hATIeloc4R87rGUqlFwMVHc++PSBXRhsAO0Qx1aQJWe8p u1Z7E3s53JsxZUz2D6xH723xRT431V5Y5fIQGl2kaRrod38tXRQicWvd/QZ/RhDaNsdN zO+A== Received: by 10.180.99.100 with SMTP id ep4mr3026961wib.7.1330778890878; Sat, 03 Mar 2012 04:48:10 -0800 (PST) Received: from thorin (7.Red-81-38-33.dynamicIP.rima-tde.net. [81.38.33.7]) by mx.google.com with ESMTPS id gp8sm9306580wib.5.2012.03.03.04.48.07 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 03 Mar 2012 04:48:09 -0800 (PST) Sender: Robert Millan Received: from rmh by thorin with local (Exim 4.72) (envelope-from ) id 1S3oNZ-0001IX-WC; Sat, 03 Mar 2012 13:48:06 +0100 Date: Sat, 3 Mar 2012 13:48:05 +0100 From: Robert Millan To: Hans Petter Selasky Message-ID: <20120303124805.GA4725@thorin> References: <201202181720.27135.hselasky@c2i.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="ZfOjI3PrQbgiZnxM" Content-Disposition: inline In-Reply-To: <201202181720.27135.hselasky@c2i.net> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Kostik Belousov , Adrian Chadd , freebsd-usb@freebsd.org, freebsd-arch@freebsd.org Subject: Re: Exclude USB drivers from main kernel image? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Mar 2012 12:48:12 -0000 --ZfOjI3PrQbgiZnxM Content-Type: multipart/mixed; boundary="EeQfGwPcQSOJBaQU" Content-Disposition: inline --EeQfGwPcQSOJBaQU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Feb 18, 2012 at 05:20:27PM +0100, Hans Petter Selasky wrote: > The /etc/devd/usb.conf is regularly updated, though not automatically. It= =20 > should auto-load most kind of devices. Only additional case that comes to= mind=20 > is that USB serial console will not be active until devd has executed, if= that=20 > is enabled. If early USB serial output is desired, it can be enabled by enabling the module in bootloader. Is that an acceptable trade-off? > Your patch looks OK. Adding ARCH @ >=20 > Instead of commenting out, I would just remove those lines. Here's a new patch that removes the lines instead of commenting them out. Consistently with that, it also removes a few lines which were already commented out, using the same criteria. Also, it disables a few more USB drivers. Due to an oversight my previous patch didn't disable all drivers that devd can handle. Patch is tested with "make universe" on HEAD. --=20 Robert Millan --EeQfGwPcQSOJBaQU Content-Type: text/x-diff; charset=us-ascii Content-Disposition: attachment; filename="usb.diff" Content-Transfer-Encoding: quoted-printable Index: sys/amd64/conf/GENERIC =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/amd64/conf/GENERIC (revision 232404) +++ sys/amd64/conf/GENERIC (working copy) @@ -302,39 +302,8 @@ device ehci # EHCI PCI->USB interface (USB 2.0) device xhci # XHCI PCI->USB interface (USB 3.0) device usb # USB Bus (required) -#device udbp # USB Double Bulk Pipe devices (needs netgraph) -device uhid # "Human Interface Devices" device ukbd # Keyboard -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da -device ums # Mouse -device urio # Diamond Rio 500 MP3 player -# USB Serial devices -device u3g # USB-based 3G modems (Option, Huawei, Sierra) -device uark # Technologies ARK3116 based serial adapters -device ubsa # Belkin F5U103 and compatible serial adapters -device uftdi # For FTDI usb serial adapters -device uipaq # Some WinCE based devices -device uplcom # Prolific PL-2303 serial adapters -device uslcom # SI Labs CP2101/CP2102 serial adapters -device uvisor # Visor and Palm devices -device uvscom # USB serial support for DDI pocket's PHS -# USB Ethernet, requires miibus -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet -device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB -# USB Wireless -device rum # Ralink Technology RT2501USB wireless NICs -device run # Ralink Technology RT2700/RT2800/RT3000 NICs. -device uath # Atheros AR5523 wireless NICs -device upgt # Conexant/Intersil PrismGT wireless NICs. -device ural # Ralink Technology RT2500USB wireless NICs -device urtw # Realtek RTL8187B/L wireless NICs -device zyd # ZyDAS zd1211/zd1211b wireless NICs =20 # FireWire support device firewire # FireWire bus code @@ -350,7 +319,6 @@ device snd_es137x # Ensoniq AudioPCI ES137x device snd_hda # Intel High Definition Audio device snd_ich # Intel, NVidia and other ICH AC'97 Audio -device snd_uaudio # USB Audio device snd_via8233 # VIA VT8233x Audio =20 # MMC/SD Index: sys/arm/conf/KB920X =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/arm/conf/KB920X (revision 232404) +++ sys/arm/conf/KB920X (working copy) @@ -99,34 +99,8 @@ options USB_DEBUG # enable debug msgs device ohci # OHCI localbus->USB interface device usb # USB Bus (required) -#device udbp # USB Double Bulk Pipe devices -device uhid # "Human Interface Devices" -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da -device urio # Diamond Rio 500 MP3 player -# USB Serial devices -device uark # Technologies ARK3116 based serial adapters -device ubsa # Belkin F5U103 and compatible serial adapters -device uftdi # For FTDI usb serial adapters -device uipaq # Some WinCE based devices -device uplcom # Prolific PL-2303 serial adapters -device uslcom # SI Labs CP2101/CP2102 serial adapters -device uvisor # Visor and Palm devices -device uvscom # USB serial support for DDI pocket's PHS -# USB Ethernet, requires miibus -device miibus -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet -device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB -# USB Wireless -device rum # Ralink Technology RT2501USB wireless NICs -device uath # Atheros AR5523 wireless NICs -device ural # Ralink Technology RT2500USB wireless NICs -device zyd # ZyDAS zd1211/zd1211b wireless NICs +device miibus # Required for USB Ethernet # SCSI peripherals device scbus # SCSI bus (required for SCSI) device da # Direct Access (disks) Index: sys/arm/conf/QILA9G20 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/arm/conf/QILA9G20 (revision 232404) +++ sys/arm/conf/QILA9G20 (working copy) @@ -124,26 +124,8 @@ device ohci # OHCI localbus->USB interface device usb # USB Bus (required) device umass # Disks/Mass storage - Requires scbus and da -device uhid # "Human Interface Devices" -#device ulpt # Printer -#device udbp # USB Double Bulk Pipe devices +device miibus # Required for USB Ethernet =20 -# USB Ethernet, requires miibus -device miibus -#device aue # ADMtek USB Ethernet -#device axe # ASIX Electronics USB Ethernet -#device cdce # Generic USB over Ethernet -#device cue # CATC USB Ethernet -#device kue # Kawasaki LSI USB Ethernet -#device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB - -# USB Wireless -#device rum # Ralink Technology RT2501USB wireless NICs -#device uath # Atheros AR5523 wireless NICs -#device ural # Ralink Technology RT2500USB wireless NICs -#device zyd # ZyDAS zd1211/zd1211b wireless NICs - # Wireless NIC cards #device wlan # 802.11 support #device wlan_wep # 802.11 WEP support Index: sys/arm/conf/HL200 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/arm/conf/HL200 (revision 232404) +++ sys/arm/conf/HL200 (working copy) @@ -98,35 +98,9 @@ options USB_DEBUG # enable debug msgs device ohci # OHCI localbus->USB interface device usb # USB Bus (required) -#device udbp # USB Double Bulk Pipe devices -device uhid # "Human Interface Devices" -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da -device urio # Diamond Rio 500 MP3 player -# USB Serial devices -device uark # Technologies ARK3116 based serial adapters -device ubsa # Belkin F5U103 and compatible serial adapters #device ubser # not yet converted. -device uftdi # For FTDI usb serial adapters -device uipaq # Some WinCE based devices -device uplcom # Prolific PL-2303 serial adapters -device uslcom # SI Labs CP2101/CP2102 serial adapters -device uvisor # Visor and Palm devices -device uvscom # USB serial support for DDI pocket's PHS -# USB Ethernet, requires miibus -device miibus -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet -device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB -# USB Wireless -device rum # Ralink Technology RT2501USB wireless NICs -device uath # Atheros AR5523 wireless NICs -device ural # Ralink Technology RT2500USB wireless NICs -device zyd # ZyDAS zd1211/zd1211b wireless NICs +device miibus # Required for USB Ethernet # SCSI peripherals device scbus # SCSI bus (required for SCSI) device da # Direct Access (disks) Index: sys/arm/conf/HL201 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/arm/conf/HL201 (revision 232404) +++ sys/arm/conf/HL201 (working copy) @@ -99,25 +99,8 @@ # USB support #device ohci # OHCI localbus->USB interface device usb # USB Bus (required) -#device udbp # USB Double Bulk Pipe devices -device uhid # "Human Interface Devices" -#device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da - -# USB Ethernet, requires miibus -device miibus -#device aue # ADMtek USB Ethernet -#device axe # ASIX Electronics USB Ethernet -#device cdce # Generic USB over Ethernet -#device cue # CATC USB Ethernet -#device kue # Kawasaki LSI USB Ethernet -#device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB -# USB Wireless -#device rum # Ralink Technology RT2501USB wireless NICs -#device uath # Atheros AR5523 wireless NICs -#device ural # Ralink Technology RT2500USB wireless NICs -#device zyd # ZyDAS zd1211/zd1211b wireless NICs +device miibus # Required for USB Ethernet # SCSI peripherals device scbus # SCSI bus (required for SCSI) device da # Direct Access (disks) Index: sys/arm/conf/SAM9G20EK =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/arm/conf/SAM9G20EK (revision 232404) +++ sys/arm/conf/SAM9G20EK (working copy) @@ -124,26 +124,8 @@ device ohci # OHCI localbus->USB interface device usb # USB Bus (required) device umass # Disks/Mass storage - Requires scbus and da -device uhid # "Human Interface Devices" -#device ulpt # Printer -#device udbp # USB Double Bulk Pipe devices +device miibus # Required for USB Ethernet =20 -# USB Ethernet, requires miibus -device miibus -#device aue # ADMtek USB Ethernet -#device axe # ASIX Electronics USB Ethernet -#device cdce # Generic USB over Ethernet -#device cue # CATC USB Ethernet -#device kue # Kawasaki LSI USB Ethernet -#device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB - -# USB Wireless -#device rum # Ralink Technology RT2501USB wireless NICs -#device uath # Atheros AR5523 wireless NICs -#device ural # Ralink Technology RT2500USB wireless NICs -#device zyd # ZyDAS zd1211/zd1211b wireless NICs - # Wireless NIC cards #device wlan # 802.11 support #device wlan_wep # 802.11 WEP support Index: sys/i386/conf/XBOX =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/i386/conf/XBOX (revision 232404) +++ sys/i386/conf/XBOX (working copy) @@ -80,20 +80,10 @@ #device uhci # UHCI PCI->USB interface device ohci # OHCI PCI->USB interface device usb # USB Bus (required) -device uhid # "Human Interface Devices" device ukbd # Keyboard -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da -device ums # Mouse -device urio # Diamond Rio 500 MP3 player =20 device miibus -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet -device rue # RealTek RTL8150 USB Ethernet =20 device sound device snd_ich # nForce audio Index: sys/i386/conf/GENERIC =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/i386/conf/GENERIC (revision 232404) +++ sys/i386/conf/GENERIC (working copy) @@ -315,39 +315,8 @@ device ehci # EHCI PCI->USB interface (USB 2.0) device xhci # XHCI PCI->USB interface (USB 3.0) device usb # USB Bus (required) -#device udbp # USB Double Bulk Pipe devices (needs netgraph) -device uhid # "Human Interface Devices" device ukbd # Keyboard -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da -device ums # Mouse -device urio # Diamond Rio 500 MP3 player -# USB Serial devices -device u3g # USB-based 3G modems (Option, Huawei, Sierra) -device uark # Technologies ARK3116 based serial adapters -device ubsa # Belkin F5U103 and compatible serial adapters -device uftdi # For FTDI usb serial adapters -device uipaq # Some WinCE based devices -device uplcom # Prolific PL-2303 serial adapters -device uslcom # SI Labs CP2101/CP2102 serial adapters -device uvisor # Visor and Palm devices -device uvscom # USB serial support for DDI pocket's PHS -# USB Ethernet, requires miibus -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet -device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB -# USB Wireless -device rum # Ralink Technology RT2501USB wireless NICs -device run # Ralink Technology RT2700/RT2800/RT3000 NICs. -device uath # Atheros AR5523 wireless NICs -device upgt # Conexant/Intersil PrismGT wireless NICs. -device ural # Ralink Technology RT2500USB wireless NICs -device urtw # Realtek RTL8187B/L wireless NICs -device zyd # ZyDAS zd1211/zd1211b wireless NICs =20 # FireWire support device firewire # FireWire bus code @@ -363,7 +332,6 @@ device snd_es137x # Ensoniq AudioPCI ES137x device snd_hda # Intel High Definition Audio device snd_ich # Intel, NVidia and other ICH AC'97 Audio -device snd_uaudio # USB Audio device snd_via8233 # VIA VT8233x Audio =20 # MMC/SD Index: sys/ia64/conf/GENERIC =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/ia64/conf/GENERIC (revision 232404) +++ sys/ia64/conf/GENERIC (working copy) @@ -127,11 +127,8 @@ device ehci # EHCI host controller device ohci # OHCI PCI->USB interface device uhci # UHCI PCI->USB interface -device uhid # Human Interface Devices device ukbd # Keyboard -device ulpt # Printer device umass # Disks/Mass storage (need scbus & da) -device ums # Mouse =20 # PCI Ethernet NICs. device de # DEC/Intel DC21x4x (``Tulip'') @@ -162,25 +159,6 @@ device vge # VIA VT612x gigabit Ethernet device xl # 3Com 3c90x ("Boomerang", "Cyclone") =20 -# USB Ethernet -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet -device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB - -# USB Serial -device uark # Technologies ARK3116 based serial adapters -device ubsa # Belkin F5U103 and compatible serial adapters -device uftdi # For FTDI usb serial adapters -device uipaq # Some WinCE based devices -device uplcom # Prolific PL-2303 serial adapters -device uslcom # SI Labs CP2101/CP2102 serial adapters -device uvisor # Visor and Palm devices -device uvscom # USB serial support for DDI pocket's PHS - # Wireless NIC cards. # The wlan(4) module assumes this, so just define it so it # at least correctly loads. Index: sys/mips/conf/XLRN32 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/mips/conf/XLRN32 (revision 232404) +++ sys/mips/conf/XLRN32 (working copy) @@ -104,9 +104,7 @@ device ehci # EHCI PCI->USB interface (USB 2.0) device usb # USB Bus (required) options USB_DEBUG # enable debug msgs -#device udbp # USB Double Bulk Pipe devices #device ugen # Generic -#device uhid # "Human Interface Devices" device umass # Disks/Mass storage - Requires scbus and = da =20 #device cfi Index: sys/mips/conf/XLR64 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/mips/conf/XLR64 (revision 232404) +++ sys/mips/conf/XLR64 (working copy) @@ -103,7 +103,6 @@ device ehci # EHCI PCI->USB interface (USB 2.0) device usb # USB Bus (required) options USB_DEBUG # enable debug msgs -#device uhid # "Human Interface Devices" device umass # Disks/Mass storage - Requires scbus and da =20 #device cfi Index: sys/mips/conf/std.XLP =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/mips/conf/std.XLP (revision 232404) +++ sys/mips/conf/std.XLP (working copy) @@ -81,7 +81,6 @@ device ehci # EHCI PCI->USB interface (USB 2.0) #options USB_DEBUG # enable debug msgs #device ugen # Generic -#device uhid # "Human Interface Devices" device umass # Requires scbus and da =20 options FDT Index: sys/mips/conf/XLR =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/mips/conf/XLR (revision 232404) +++ sys/mips/conf/XLR (working copy) @@ -128,7 +128,6 @@ device ehci # EHCI PCI->USB interface (USB 2.0) device usb # USB Bus (required) #options USB_DEBUG # enable debug msgs -#device uhid # "Human Interface Devices" device umass # Disks/Mass storage - Requires scbus and da =20 #device cfi Index: sys/mips/conf/OCTEON1 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/mips/conf/OCTEON1 (revision 232404) +++ sys/mips/conf/OCTEON1 (working copy) @@ -267,32 +267,4 @@ device ohci # OHCI PCI->USB interface device ehci # EHCI PCI->USB interface (USB 2.0) device usb # USB Bus (required) -#device udbp # USB Double Bulk Pipe devices -device uhid # "Human Interface Devices" -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da -device ums # Mouse -device urio # Diamond Rio 500 MP3 player -# USB Serial devices -device u3g # USB-based 3G modems (Option, Huawei, Sierra) -device uark # Technologies ARK3116 based serial adapters -device ubsa # Belkin F5U103 and compatible serial adapters -device uftdi # For FTDI usb serial adapters -device uipaq # Some WinCE based devices -device uplcom # Prolific PL-2303 serial adapters -device uslcom # SI Labs CP2101/CP2102 serial adapters -device uvisor # Visor and Palm devices -device uvscom # USB serial support for DDI pocket's PHS -# USB Ethernet, requires miibus -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet -device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB -# USB Wireless -device rum # Ralink Technology RT2501USB wireless NICs -device uath # Atheros AR5523 wireless NICs -device ural # Ralink Technology RT2500USB wireless NICs -device zyd # ZyDAS zd1211/zd1211b wireless NICs Index: sys/pc98/conf/GENERIC =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/pc98/conf/GENERIC (revision 232404) +++ sys/pc98/conf/GENERIC (working copy) @@ -239,36 +239,9 @@ #device ohci # OHCI PCI->USB interface #device ehci # EHCI PCI->USB interface (USB 2.0) #device usb # USB Bus (required) -#device udbp # USB Double Bulk Pipe devices (needs netgraph) -#device uhid # "Human Interface Devices" #device ukbd # Keyboard -#device ulpt # Printer #device umass # Disks/Mass storage - Requires scbus and da -#device ums # Mouse -#device urio # Diamond Rio 500 MP3 player -# USB Serial devices -#device uark # Technologies ARK3116 based serial adapters -#device ubsa # Belkin F5U103 and compatible serial adapters #device ubser # BWCT console serial adapters -#device uftdi # For FTDI usb serial adapters -#device uipaq # Some WinCE based devices -#device uplcom # Prolific PL-2303 serial adapters -#device uslcom # SI Labs CP2101/CP2102 serial adapters -#device uvisor # Visor and Palm devices -#device uvscom # USB serial support for DDI pocket's PHS -# USB Ethernet, requires miibus -#device aue # ADMtek USB Ethernet -#device axe # ASIX Electronics USB Ethernet -#device cdce # Generic USB over Ethernet -#device cue # CATC USB Ethernet -#device kue # Kawasaki LSI USB Ethernet -#device rue # RealTek RTL8150 USB Ethernet -#device udav # Davicom DM9601E USB -# USB Wireless -#device rum # Ralink Technology RT2501USB wireless NICs -#device uath # Atheros AR5523 wireless NICs -#device ural # Ralink Technology RT2500USB wireless NICs -#device zyd # ZyDAS zd1211/zd1211b wireless NICs =20 # FireWire support #device firewire # FireWire bus code @@ -280,4 +253,3 @@ #device snd_mss # Microsoft Sound System #device "snd_sb16" # Sound Blaster 16 #device snd_sbc # Sound Blaster -#device snd_uaudio # USB Audio Index: sys/powerpc/conf/GENERIC64 =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/powerpc/conf/GENERIC64 (revision 232404) +++ sys/powerpc/conf/GENERIC64 (working copy) @@ -156,19 +156,9 @@ device ohci # OHCI PCI->USB interface device ehci # EHCI PCI->USB interface device usb # USB Bus (required) -device uhid # "Human Interface Devices" device ukbd # Keyboard options KBD_INSTALL_CDEV # install a CDEV entry in /dev -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da0 -device ums # Mouse -device urio # Diamond Rio 500 MP3 player -# USB Ethernet -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet =20 # Wireless NIC cards options IEEE80211_SUPPORT_MESH @@ -197,5 +187,4 @@ # Sound support device sound # Generic sound driver (required) device snd_ai2s # Apple I2S audio -device snd_uaudio # USB Audio =20 Index: sys/powerpc/conf/GENERIC =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/powerpc/conf/GENERIC (revision 232404) +++ sys/powerpc/conf/GENERIC (working copy) @@ -159,20 +159,9 @@ device ohci # OHCI PCI->USB interface device ehci # EHCI PCI->USB interface device usb # USB Bus (required) -device uhid # "Human Interface Devices" device ukbd # Keyboard options KBD_INSTALL_CDEV # install a CDEV entry in /dev -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da0 -device ums # Mouse -device atp # Apple USB touchpad -device urio # Diamond Rio 500 MP3 player -# USB Ethernet -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet =20 # Wireless NIC cards options IEEE80211_SUPPORT_MESH @@ -205,5 +194,4 @@ device sound # Generic sound driver (required) device snd_ai2s # Apple I2S audio device snd_davbus # Apple DAVBUS audio -device snd_uaudio # USB Audio =20 Index: sys/sparc64/conf/GENERIC =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- sys/sparc64/conf/GENERIC (revision 232404) +++ sys/sparc64/conf/GENERIC (working copy) @@ -235,35 +235,8 @@ device ohci # OHCI PCI->USB interface device ehci # EHCI PCI->USB interface (USB 2.0) device usb # USB Bus (required) -#device udbp # USB Double Bulk Pipe devices (needs netgraph) -device uhid # "Human Interface Devices" device ukbd # Keyboard -device ulpt # Printer device umass # Disks/Mass storage - Requires scbus and da -device ums # Mouse -device urio # Diamond Rio 500 MP3 player -# USB Serial devices -device uark # Technologies ARK3116 based serial adapters -device ubsa # Belkin F5U103 and compatible serial adapters -device uftdi # For FTDI usb serial adapters -device uipaq # Some WinCE based devices -device uplcom # Prolific PL-2303 serial adapters -device uslcom # SI Labs CP2101/CP2102 serial adapters -device uvisor # Visor and Palm devices -device uvscom # USB serial support for DDI pocket's PHS -# USB Ethernet, requires miibus -device aue # ADMtek USB Ethernet -device axe # ASIX Electronics USB Ethernet -device cdce # Generic USB over Ethernet -device cue # CATC USB Ethernet -device kue # Kawasaki LSI USB Ethernet -device rue # RealTek RTL8150 USB Ethernet -device udav # Davicom DM9601E USB -# USB Wireless -device rum # Ralink Technology RT2501USB wireless NICs -device uath # Atheros AR5523 wireless NICs -device ural # Ralink Technology RT2500USB wireless NICs -device zyd # ZyDAS zd1211/zd1211b wireless NICs =20 # FireWire support device firewire # FireWire bus code @@ -279,4 +252,3 @@ device snd_audiocs # Crystal Semiconductor CS4231 device snd_es137x # Ensoniq AudioPCI ES137x device snd_t4dwave # Acer Labs M5451 -device snd_uaudio # USB Audio --EeQfGwPcQSOJBaQU-- --ZfOjI3PrQbgiZnxM Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/kFreeBSD) iQIcBAEBCAAGBQJPUhMEAAoJELd1onhloKnORW0P/RLDg4ie4j4H0zdSTVSp2VdZ 0rKiqr4Umme0zP4weE7who3d+TN96VSpb0PVLHapr/7/24PspdBZL5fpKRe6ewe1 036TXs5E6LzxBJEUDDoY6Jh1hQaDwvftLi1LTSXF4FIluzs01ySXeoFx7eDHKtyg zpevczl5D5Bi97lpBdLQWHgKk0S+0afcw4CA1CfFGSuTkslioMw+HSwC1pp8fGKY vzINI8PWGjEN5z8oGjT+6RktTot8TpRVb2Yhe8V0T5N4AJHMTg0kKEya+wWNiLd/ f8Ur4r8mQPCXma4Etb0NNpMXzCWXaHmI6V9HT60TCuF+PN8pyYakaesJI1k5hYW0 tJf9h32QAtfl2CTtMRJ4/ZfSFBOtJCVpMd3okwm0b4nKLmNsmZ8KpvowtN+7lfQe +DxPQalBBwSEAbbAF1aSdvLQ7GfnUTxlWCZZgDVlVnOBUXmb2ar04BiW+8ZiXHui 7TcbaQK9wC293U6hePhCUlkW+OzgtKVz39J+DCH3DBQSqGyG9I6NIAiwG9xIunwV 941M521a/8SLZvK1+d4vKxwsb9j14z8Vpd3XyYPDg/8fCsGISiPqRGeMqVUu47VC N4eMO2Qanv1rYE5l0ChC/kcnB/rRbitM/+CG2d9+XA0M3gIf8zRCGdIwPpuvBvuk FHJ8GL02AfLYso/rsX9F =Ek1B -----END PGP SIGNATURE----- --ZfOjI3PrQbgiZnxM-- From owner-freebsd-arch@FreeBSD.ORG Sat Mar 3 14:17:50 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F0670106564A for ; Sat, 3 Mar 2012 14:17:50 +0000 (UTC) (envelope-from johnandsara2@cox.net) Received: from eastrmfepo101.cox.net (eastrmfepo101.cox.net [68.230.241.213]) by mx1.freebsd.org (Postfix) with ESMTP id 8A6D98FC08 for ; Sat, 3 Mar 2012 14:17:49 +0000 (UTC) Received: from eastrmimpo110.cox.net ([68.230.241.223]) by eastrmfepo101.cox.net (InterMail vM.8.01.04.00 201-2260-137-20101110) with ESMTP id <20120303141738.KEKX18243.eastrmfepo101.cox.net@eastrmimpo110.cox.net>; Sat, 3 Mar 2012 09:17:38 -0500 Received: from [192.168.3.22] ([70.177.172.35]) by eastrmimpo110.cox.net with bizsmtp id h2Hd1i00W0mAvba022HeRe; Sat, 03 Mar 2012 09:17:38 -0500 X-CT-Class: Clean X-CT-Score: 0.00 X-CT-RefID: str=0001.0A020203.4F522802.0081,ss=1,re=0.000,fgs=0 X-CT-Spam: 0 X-Authority-Analysis: v=1.1 cv=4+d3365FwXO39Q6CIaohezzFfUymJ8jBUV6iqnnMg0E= c=1 sm=1 a=f5xKl4ys9bwA:10 a=AeehsHawFTcA:10 a=G8Uczd0VNMoA:10 a=Wajolswj7cQA:10 a=8nJEP1OIZ-IA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:17 a=kviXuzpPAAAA:8 a=mMHHQ8NI51iLL4NWVPUA:9 a=wPNLvfGTeEIA:10 a=4vB-4DCPJfMA:10 a=alU6Bxxa4qBWIf+k8j/ISQ==:117 X-CM-Score: 0.00 Authentication-Results: cox.net; none Message-ID: <4F5227FE.3080708@cox.net> Date: Sat, 03 Mar 2012 09:17:34 -0500 From: "John D. Hendrickson and Sara Darnell" User-Agent: Thunderbird 2.0.0.24 (X11/20100228) MIME-Version: 1.0 To: Ed Schouten References: <4E18ABB1.4010304@cox.net> <20110709194639.GA4914@elie> <4E18EE60.7010402@cox.net> <20110710151354.GA25475@r500-debian> <4F50FD4D.9000106@cox.net> <20120302183138.GC32748@hoeg.nl> In-Reply-To: <20120302183138.GC32748@hoeg.nl> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-arch@freebsd.org Subject: Re: dep-trace v. tsort (mac ports depends support) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: johnandsara2@cox.net List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Mar 2012 14:17:51 -0000 Hi and thanks for looking ! Yes and no (no), I thought of that. Who knows to order depends so tsort can order them in non-topological order as output? Who has time to SVG plot program compile order (or pkg) depends like airports and airplanes and draw arrows between them? Another issue of pre-positioning each sublists in port files : you are SOL if there is any loss of order before tsort gets them. Another issue (one port not knowing the full sublist of the other) (there are probably more I'll stop there). Have Fun! -- John Ed Schouten wrote: > Hi John, > > * John D. Hendrickson and Sara Darnell , 20120302 18:03: >> BSD and Apple needs tsort(1) for portage still I believe. >> >> Topological sorting isn't quite right packaging. >> >> [...] >> >> (ie, for portage: you need to dl source, order of compile may be >> required, sometimes gets missing message or "loop in depends" message >> when attempting to compile and install pkg) > > But wait. Isn't this because of mis-use of tsort(1) by portage? > > tsort(1) can give you any ordering you like, as long as you make sure > your input graph is correct. > From owner-freebsd-arch@FreeBSD.ORG Sat Mar 3 15:12:40 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8CCCB106564A for ; Sat, 3 Mar 2012 15:12:40 +0000 (UTC) (envelope-from ed@hoeg.nl) Received: from mx0.hoeg.nl (mx0.hoeg.nl [IPv6:2a01:4f8:101:5343::aa]) by mx1.freebsd.org (Postfix) with ESMTP id 24BAA8FC08 for ; Sat, 3 Mar 2012 15:12:40 +0000 (UTC) Received: by mx0.hoeg.nl (Postfix, from userid 1000) id 5C4272A28CCF; Sat, 3 Mar 2012 16:12:39 +0100 (CET) Date: Sat, 3 Mar 2012 16:12:39 +0100 From: Ed Schouten To: "John D. Hendrickson and Sara Darnell" Message-ID: <20120303151239.GF32748@hoeg.nl> References: <4E18ABB1.4010304@cox.net> <20110709194639.GA4914@elie> <4E18EE60.7010402@cox.net> <20110710151354.GA25475@r500-debian> <4F50FD4D.9000106@cox.net> <20120302183138.GC32748@hoeg.nl> <4F5227FE.3080708@cox.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Rln2GmQ7CFmDhc9B" Content-Disposition: inline In-Reply-To: <4F5227FE.3080708@cox.net> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-arch@freebsd.org Subject: Re: dep-trace v. tsort (mac ports depends support) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Mar 2012 15:12:40 -0000 --Rln2GmQ7CFmDhc9B Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi John, * John D. Hendrickson and Sara Darnell , 20120303 15:= 17: > Who knows to order depends so tsort can order them in non-topological > order as output? Who has time to SVG plot program compile order (or > pkg) depends like airports and airplanes and draw arrows between > them? >=20 > Another issue of pre-positioning each sublists in port files : you > are SOL if there is any loss of order before tsort gets them. >=20 > Another issue (one port not knowing the full sublist of the other) > (there are probably more I'll stop there). But the point is that if applications are looking for such things, they are typically implemented in a higher-level programming language that allows them to implement such features themselves, instead of relying on a 1980s command line tool. --=20 Ed Schouten WWW: http://80386.nl/ --Rln2GmQ7CFmDhc9B Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (FreeBSD) iQIcBAEBAgAGBQJPUjTnAAoJEG5e2P40kaK7OvcP/13zepoBcf0y2NuGVZZsG/dB nOcuBe+iN4NbOLc5NpQXpa1Jo25gXTlrHBx5QuYy12gRIlDHTOAGZNMhUZq0zUgX BQM88POAHP5xUDV0d9icq4PKgeOfykoEOflXNezuefhc5ImCEASGd2fY2R1nw3T4 iwuj8jNV97wj4bZn0RIvByUBcSMVLOZnyRq+6SMxUUJ+IDlFOVtnsiwGuJH5qoI3 8ZezSMIw5QAUX+2xuDNXInQlRrYxLk/B6xcKTZWq0nqj2cy6uwRBTqtXdeiZ7KVT u0jpZuvSSFf3oZRSa51Puqx0VVHGY9OVDHQigncCVc7wY62XwoBJ8nO3FG9wTndP 92yAvq7Fkf0UgQo1gpaROVtVioPRhS6h4g6o4cW0qtrs4XsygEcne3OOuywZiL+R TLbah2ByZNS05RT8XkKQVs1Jn+6I+fDn0LNfobSvfZdtL6BCL2iJUgqgMPzRvDMR uzAvhtMQhFXhG2BN0GFoQjLD18M8mC5xChs/hluAsmBHM5u5+Qqtu9c7YlqDlpzl YArTAXqiGKW/5p1jrBFAVNpe3H9tZ2de10RI817dVfmXFEdn3oCSPP2Qn7Ih5FIh 90HfmC1JQ4oJSl4ITNXJDiV0kBlf1hwKq7Dr3kh0zt7eqWGupmdlueQeXat2JbzF JQoJyCk5r76mqZ+P5Nb/ =+bRr -----END PGP SIGNATURE----- --Rln2GmQ7CFmDhc9B-- From owner-freebsd-arch@FreeBSD.ORG Sat Mar 3 18:04:03 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 54C79106582B; Sat, 3 Mar 2012 18:04:00 +0000 (UTC) (envelope-from hselasky@c2i.net) Received: from swip.net (mailfe05.c2i.net [212.247.154.130]) by mx1.freebsd.org (Postfix) with ESMTP id 4B0808FC13; Sat, 3 Mar 2012 18:03:58 +0000 (UTC) X-T2-Spam-Status: No, hits=-0.2 required=5.0 tests=ALL_TRUSTED, BAYES_50 Received: from [176.74.212.201] (account mc467741@c2i.net HELO laptop002.hselasky.homeunix.org) by mailfe05.swip.net (CommuniGate Pro SMTP 5.4.2) with ESMTPA id 244285303; Sat, 03 Mar 2012 19:03:51 +0100 From: Hans Petter Selasky To: freebsd-arch@freebsd.org Date: Sat, 3 Mar 2012 19:02:10 +0100 User-Agent: KMail/1.13.5 (FreeBSD/8.3-PRERELEASE; KDE/4.4.5; amd64; ; ) References: <201202181720.27135.hselasky@c2i.net> <20120303124805.GA4725@thorin> In-Reply-To: <20120303124805.GA4725@thorin> X-Face: 'mmZ:T{)),Oru^0c+/}w'`gU1$ubmG?lp!=R4Wy\ELYo2)@'UZ24N@d2+AyewRX}mAm; Yp |U[@, _z/([?1bCfM{_"B<.J>mICJCHAzzGHI{y7{%JVz%R~yJHIji`y>Y}k1C4TfysrsUI -%GU9V5]iUZF&nRn9mJ'?&>O MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201203031902.11035.hselasky@c2i.net> Cc: Kostik Belousov , Adrian Chadd , freebsd-usb@freebsd.org, Robert Millan Subject: Re: Exclude USB drivers from main kernel image? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Mar 2012 18:04:03 -0000 On Saturday 03 March 2012 13:48:05 Robert Millan wrote: > On Sat, Feb 18, 2012 at 05:20:27PM +0100, Hans Petter Selasky wrote: > > The /etc/devd/usb.conf is regularly updated, though not automatically. It > > should auto-load most kind of devices. Only additional case that comes to > > mind is that USB serial console will not be active until devd has > > executed, if that is enabled. > > If early USB serial output is desired, it can be enabled by enabling the > module in bootloader. Is that an acceptable trade-off? > > > Your patch looks OK. Adding ARCH @ > > > > Instead of commenting out, I would just remove those lines. > > Here's a new patch that removes the lines instead of commenting them out. > > Consistently with that, it also removes a few lines which were already > commented out, using the same criteria. > > Also, it disables a few more USB drivers. Due to an oversight my previous > patch didn't disable all drivers that devd can handle. > > Patch is tested with "make universe" on HEAD. Hi, Your patch looks good. Are there any objections committing the patch attached to the previous e-mail? --HPS From owner-freebsd-arch@FreeBSD.ORG Sat Mar 3 18:42:12 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D7900106564A; Sat, 3 Mar 2012 18:42:12 +0000 (UTC) (envelope-from imp@bsdimp.com) Received: from harmony.bsdimp.com (bsdimp.com [199.45.160.85]) by mx1.freebsd.org (Postfix) with ESMTP id 628A08FC08; Sat, 3 Mar 2012 18:42:09 +0000 (UTC) Received: from 63.imp.bsdimp.com (63.imp.bsdimp.com [10.0.0.63]) (authenticated bits=0) by harmony.bsdimp.com (8.14.4/8.14.3) with ESMTP id q23Ibnj7081759 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES128-SHA bits=128 verify=NO); Sat, 3 Mar 2012 11:37:50 -0700 (MST) (envelope-from imp@bsdimp.com) Mime-Version: 1.0 (Apple Message framework v1084) Content-Type: text/plain; charset=us-ascii From: Warner Losh In-Reply-To: <201203031902.11035.hselasky@c2i.net> Date: Sat, 3 Mar 2012 11:37:49 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <201202181720.27135.hselasky@c2i.net> <20120303124805.GA4725@thorin> <201203031902.11035.hselasky@c2i.net> To: Hans Petter Selasky X-Mailer: Apple Mail (2.1084) X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (harmony.bsdimp.com [10.0.0.6]); Sat, 03 Mar 2012 11:37:50 -0700 (MST) Cc: Kostik Belousov , Adrian Chadd , Robert Millan , freebsd-usb@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG Subject: Re: Exclude USB drivers from main kernel image? X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Mar 2012 18:42:12 -0000 On Mar 3, 2012, at 11:02 AM, Hans Petter Selasky wrote: > On Saturday 03 March 2012 13:48:05 Robert Millan wrote: >> On Sat, Feb 18, 2012 at 05:20:27PM +0100, Hans Petter Selasky wrote: >>> The /etc/devd/usb.conf is regularly updated, though not = automatically. It >>> should auto-load most kind of devices. Only additional case that = comes to >>> mind is that USB serial console will not be active until devd has >>> executed, if that is enabled. >>=20 >> If early USB serial output is desired, it can be enabled by enabling = the >> module in bootloader. Is that an acceptable trade-off? >>=20 >>> Your patch looks OK. Adding ARCH @ >>>=20 >>> Instead of commenting out, I would just remove those lines. >>=20 >> Here's a new patch that removes the lines instead of commenting them = out. >>=20 >> Consistently with that, it also removes a few lines which were = already >> commented out, using the same criteria. >>=20 >> Also, it disables a few more USB drivers. Due to an oversight my = previous >> patch didn't disable all drivers that devd can handle. >>=20 >> Patch is tested with "make universe" on HEAD. >=20 > Hi, >=20 > Your patch looks good. >=20 > Are there any objections committing the patch attached to the previous = e-mail? Do all the platforms that had the devices removed work? Have they all = been tested? Warner