From owner-freebsd-arch Sun Oct 29 6:14:54 2000 Delivered-To: freebsd-arch@freebsd.org Received: from njord.bart.nl (njord.bart.nl [194.158.170.15]) by hub.freebsd.org (Postfix) with ESMTP id AFC9B37B4C5 for ; Sun, 29 Oct 2000 06:14:51 -0800 (PST) Received: from daemon.chronias.ninth-circle.org (root@daemon.ninth-circle.org [195.38.210.81]) by njord.bart.nl (8.10.1/8.10.1) with ESMTP id e9TEDeO39116 for ; Sun, 29 Oct 2000 15:13:56 +0100 (CET) Received: (from asmodai@localhost) by daemon.chronias.ninth-circle.org (8.11.0/8.11.0) id e9TEC6k98021 for arch@freebsd.org; Sun, 29 Oct 2000 15:12:06 +0100 (CET) (envelope-from asmodai) Date: Sun, 29 Oct 2000 15:12:06 +0100 From: Jeroen Ruigrok/Asmodai To: arch@freebsd.org Subject: endian.h Message-ID: <20001029151205.C88101@daemon.ninth-circle.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i Organisation: Ninth-Circle Enterprises Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG This is a bit of a hairy problem. On FreeBSD we have which defines things like BYTE_ORDER and BIG_ENDIAN. On OpenBSD we have which is included from <[archname]/endian.h> and defines BIG_ENDIAN and BYTE_ORDER. On NetBSD we also have , but we also have <[archname]/endian.h> and <[archname]/endian_machdep.h> and there we define _BYTE_ORDER and _BIG_ENDIAN. However they also wrap BYTE_ORDER and BIG_ENDIAN with _POSIX_SOURCE to provide backwards compatibility. Linux has which defines __BYTE_ORDER and __BIG_ENDIAN. I cannot find, this quickly, any references towards an endian.h or BYTE_ORDER/BIG_ENDIAN standardness in C99, POSIX [latest drafts] or in SUSv2. Is someone able to shed some light on this? I could envision that we might want a symlinked to and possibly our old and traditional BSD BYTE_ORDER definitions wrapped as NetBSD did and then define some underscored names for usage in POSIX environments. -- Jeroen Ruigrok vd Werven/Asmodai asmodai@[wxs.nl|bart.nl|freebsd.org] Documentation nutter/C-rated Coder BSD: Technical excellence at its best The BSD Programmer's Documentation Project Only in sleep can one find salvation that resembles Death... To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Oct 30 3:10:24 2000 Delivered-To: freebsd-arch@freebsd.org Received: from lists01.iafrica.com (lists01.iafrica.com [196.7.0.141]) by hub.freebsd.org (Postfix) with ESMTP id 169D737B4C5; Mon, 30 Oct 2000 03:10:20 -0800 (PST) Received: from nwl.fw.uunet.co.za ([196.31.2.162]) by lists01.iafrica.com with esmtp (Exim 3.12 #2) id 13qCpO-00077m-00; Mon, 30 Oct 2000 13:10:10 +0200 Received: (from nobody@localhost) by nwl.fw.uunet.co.za (8.8.8/8.6.9) id NAA01116; Mon, 30 Oct 2000 13:10:16 +0200 (SAST) Received: by nwl.fw.uunet.co.za via recvmail id 861; Mon Oct 30 13:09:06 2000 Received: from sheldonh (helo=axl.fw.uunet.co.za) by axl.fw.uunet.co.za with local-esmtp (Exim 3.16 #1) id 13qCoM-0001pP-00; Mon, 30 Oct 2000 13:09:06 +0200 From: Sheldon Hearn To: obrien@freebsd.org Cc: John Baldwin , arch@freebsd.org Subject: Re: moving manpages out of /sys/modules In-reply-to: Your message of "Thu, 05 Oct 2000 12:36:45 MST." <20001005123644.A56993@dragon.nuxi.com> Date: Mon, 30 Oct 2000 13:09:06 +0200 Message-ID: <7030.972904146@axl.fw.uunet.co.za> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Thu, 05 Oct 2000 12:36:45 MST, "David O'Brien" wrote: > I've asked for a repo copy of the scripts. Sheldon is dealing with the > manpages. I've asked for the shell scripts to go into usr.sbin as these > really aren't general user commands. In my opinion, these scripts should just go away. Ciao, Sheldon. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Oct 30 10:16:54 2000 Delivered-To: freebsd-arch@freebsd.org Received: from falla.videotron.net (falla.videotron.net [205.151.222.106]) by hub.freebsd.org (Postfix) with ESMTP id D95F737B479; Mon, 30 Oct 2000 10:16:47 -0800 (PST) Received: from modemcable213.3-201-24.mtl.mc.videotron.ca ([24.201.3.213]) by falla.videotron.net (Sun Internet Mail Server sims.3.5.1999.12.14.10.29.p8) with ESMTP id <0G3900M7G9F772@falla.videotron.net>; Mon, 30 Oct 2000 13:16:19 -0500 (EST) Date: Mon, 30 Oct 2000 13:20:52 -0500 (EST) From: Bosko Milekic Subject: MP: per-CPU mbuf allocation lists X-Sender: bmilekic@jehovah.technokratis.com To: freebsd-net@freebsd.org Cc: freebsd-arch@freebsd.org Message-id: MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [cross-posted to freebsd-arch and freebsd-net, please continue discussion on freebsd-net] Hello, I recently wrote an initial "scratch pad" design for per-CPU mbuf lists (in the MP case). The design consists simply of introducing these "fast" lists for each CPU and populating them with mbufs on bootup. Allocations from these lists would not need to be protected with a mutex as each CPU has its own. The general mmbfree list remains, and remains protected with a mutex, in case the per-CPU list is empty. My initial idea was to leave freeing to the general list, and have a kproc "daemon" periodically populate the "fast" lists. This would have of course involved the addition of a mutex for each "fast" list as well, in order to insure synch with the kproc. However, in the great majority of cases when the kproc would be sleeping, the acquiring of the mutex for the fast list would be very cheap, as waiting for it would never be an issue. Yesterday, Alfred pointed me to the HOARD web page and made several suggestions... all worthy of my attention. The changes I have decided to make to the design will make the system work as follows: - "Fast" list; a per-CPU mbuf list. They contain "w" (for "watermark") number of mbufs, typically... more on this below. - The general (already existing) mmbfree list; mutex protected, global list, in case the fast list is empty for the given CPU. - Allocations; all done from "fast" lists. All are very fast, in the general case. If no mbufs are available, the general mmbfree list's lock is acquired, and an mbuf is made from there. If no mbuf is available, even from the general list, we let go of the lock and allocate a page from mb_map and drop the mbufs onto our fast list, from which we grab the one we need. If mb_map is starved, then: (a) if M_NOWAIT, return ENOBUFS (b) go to sleep, if timeout, return ENOBUFS (c) not timeout, so got a wakeup, the wakeup was accompanied with the acquiring of the mmbfree general list. Since we were sleeping, we are insured that there is an mbuf waiting for us on the general mmbfree list, so we grab it and drop the lock (see the "freeing" section on why we know there's one on mmbfree). - Freeing; First, if someone is sleeping, we grab the mmbfree global list mutex and drop the mbuf there, and then issue a wakeup. If nobody is sleeping, then we proceed as follows: (a) if our fast list does not have over "w" mbufs, put the mbuf on our fast list and then we're done (b) since our fast list already has "w" mbufs, acquire the mmbfree mutex and drop the mbuf there. Things to note: - note that if we're out of mbufs on our fast list, and the general mmbfree list has none available either, and mb_map is starved, even though there may be free mbufs on other CPU's fast lists, we will return ENOBUFS. This behavior will usually be an indication of a wrongly chosen watermark ("w") and we will have to consider how to inform our users on how to properly select a watermark. I already have some ideas for alternate situations/ways of handeling this, but will leave this investigation for later. - "w" is a tunable watermark. No fast list will ever contain more than "w" mbufs. This presents a small problem. Consider a situation where we initially set w = 500; consider we have two CPUs; consider CPU1's fast list eventually gets 450 mbufs, and CPU2's fast list gets 345. Consider then that we decide to set w = 200; Even though all subsequent freeing will be done to the mmbfree list, unless we eventually go under the 200 mark for our free list, we will likely end up sitting with > 200 mbufs on each CPU's fast list. The idea I presently have is to have a kproc "garbage collect" > w mbufs on the CPUs' fast lists and put them back onto the mmbfree general list, if it detects that "w" has been lowered. I'm looking for input. Please feel free to comment with the _specifics_ of the system in mind. Thanks in advance to Alfred who has already generated input. :-) Cheers, Bosko Milekic bmilekic@technokratis.com P.S.: Most of the beneficial effects of this system will get to be seen when the stack is fully threaded. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Oct 30 11:30:13 2000 Delivered-To: freebsd-arch@freebsd.org Received: from implode.root.com (root.com [209.102.106.178]) by hub.freebsd.org (Postfix) with ESMTP id A229837B4C5; Mon, 30 Oct 2000 11:30:10 -0800 (PST) Received: from implode.root.com (localhost [127.0.0.1]) by implode.root.com (8.8.8/8.8.5) with ESMTP id LAA01623; Mon, 30 Oct 2000 11:27:50 -0800 (PST) Message-Id: <200010301927.LAA01623@implode.root.com> To: Bosko Milekic Cc: freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists In-reply-to: Your message of "Mon, 30 Oct 2000 13:20:52 EST." From: David Greenman Reply-To: dg@root.com Date: Mon, 30 Oct 2000 11:27:50 -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > I recently wrote an initial "scratch pad" design for per-CPU mbuf > lists (in the MP case). The design consists simply of introducing > these "fast" lists for each CPU and populating them with mbufs on bootup. > Allocations from these lists would not need to be protected with a mutex > as each CPU has its own. The general mmbfree list remains, and remains > protected with a mutex, in case the per-CPU list is empty. I have only one question - is the lock overhead really so high that this is needed? -DG David Greenman Co-founder, The FreeBSD Project - http://www.freebsd.org President, TeraSolutions, Inc. - http://www.terasolutions.com Pave the road of life with opportunities. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Oct 30 14:14:32 2000 Delivered-To: freebsd-arch@freebsd.org Received: from feral.com (feral.com [192.67.166.1]) by hub.freebsd.org (Postfix) with ESMTP id 5317137B479; Mon, 30 Oct 2000 14:14:28 -0800 (PST) Received: from beppo (beppo [192.67.166.79]) by feral.com (8.9.3/8.9.3) with ESMTP id OAA03349; Mon, 30 Oct 2000 14:14:03 -0800 Date: Mon, 30 Oct 2000 14:14:03 -0800 (PST) From: Matthew Jacob Reply-To: mjacob@feral.com To: David Greenman Cc: Bosko Milekic , freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists In-Reply-To: <200010301927.LAA01623@implode.root.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Mon, 30 Oct 2000, David Greenman wrote: > > I recently wrote an initial "scratch pad" design for per-CPU mbuf > > lists (in the MP case). The design consists simply of introducing > > these "fast" lists for each CPU and populating them with mbufs on bootup. > > Allocations from these lists would not need to be protected with a mutex > > as each CPU has its own. The general mmbfree list remains, and remains > > protected with a mutex, in case the per-CPU list is empty. > > I have only one question - is the lock overhead really so high that this > is needed? If you know that you can also pre-busdma wrap these lists (which is required for full alpha support, and may(?) be for ia64), yes, this makes sense to me (at least). I had a friend at Sun not speak to me for years because I didn't do this for the Solaris DKI/DDI. -matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Oct 30 16:39:52 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id 7444B37B4C5; Mon, 30 Oct 2000 16:39:44 -0800 (PST) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id RAA00073; Mon, 30 Oct 2000 17:40:11 -0700 (MST) Received: from usr05.primenet.com(206.165.6.205) via SMTP by smtp05.primenet.com, id smtpdAAAE4aWfa; Mon Oct 30 17:40:04 2000 Received: (from tlambert@localhost) by usr05.primenet.com (8.8.5/8.8.5) id RAA29844; Mon, 30 Oct 2000 17:39:26 -0700 (MST) From: Terry Lambert Message-Id: <200010310039.RAA29844@usr05.primenet.com> Subject: Re: MP: per-CPU mbuf allocation lists To: mjacob@feral.com Date: Tue, 31 Oct 2000 00:39:26 +0000 (GMT) Cc: dg@root.com (David Greenman), bmilekic@dsuper.net (Bosko Milekic), freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG In-Reply-To: from "Matthew Jacob" at Oct 30, 2000 02:14:03 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > > I recently wrote an initial "scratch pad" design for per-CPU mbuf > > > lists (in the MP case). The design consists simply of introducing > > > these "fast" lists for each CPU and populating them with mbufs on bootup. > > > Allocations from these lists would not need to be protected with a mutex > > > as each CPU has its own. The general mmbfree list remains, and remains > > > protected with a mutex, in case the per-CPU list is empty. > > > > I have only one question - is the lock overhead really so high that this > > is needed? > > If you know that you can also pre-busdma wrap these lists (which is required > for full alpha support, and may(?) be for ia64), yes, this makes sense to me > (at least). I had a friend at Sun not speak to me for years because I didn't > do this for the Solaris DKI/DDI. I really, really like per CPU resource pools. It's one of my SMP "hot buttons". I suggest reading the technical white papers at: http://www.sequent.com/software/operatingsys/dynix.html I also recommend "UNIX for Modern Architectures", and chapters 12 and 15 of the Vahalia book: UNIX internals: the new frontiers Uresh Vahalia Prentice Hall ISBN: 0-13-101908-2 Chaper 15 has an interesting point about LWPs in SVR4/MP, where kernel mappings use lazy TLB shootdown, but user mappings use an immediate shootdown policy, to insure that a mapping going away in one LWP goes away in other LWPs. Bonwick, who wrote the Solaris SLAB allocator, indicates in his paper about the allocator acknowledges that the allocator would benefit from per-CPU caches, like those in Dynix. See also: http://www.usenix.org/publications/library/proceedings/bos94/bonwick.html Unfortunately, the Winter 1993 proceedings, which has the Dynix paper, are not online. The authors of the paper have an insane number of SMP related patents, including a neat one on "A substantially zero overhead mutual exclusion apparatus..." that looks like soft updates applied to concurrency. Per CPU resource pools with high and low watermarking were part of the Dynix design. Here is a passage from Vahalia, chapter 12, about it (my comments and contextual clues are bracketed): 12.9 A Hierarchical Allocator for Multiprocessors Memory allocation for a shared-memory multiprocessor raises some additional concerns. Data structures such as free lists and allocation bitmaps used by traditional systems are not multiprocessor-safe and must be protected by locks. In large, parallel systems, this results in heavy contention for these locks, and CPUs frequently stall while waiting for these locks to be released. One solution to this problem is implemeneted in Dynix, a multiprocessor UNIX variant for the Sequent S2000 machines. It uses a hierarchical allocation scheme that supports the System V programming interface. The sequent multiprocessors are used in large on-line transaction-processing environments, and the allocator performs well under that load. Figure 12-9 [omitted] describes the design of the allocator. The lowest (per-CPU) layer allows the fastest operations, while the highest (coalesce-to-page) layer is for the time- consuming coalescing process. There is also (not shown) a coalesce-to-vmblock layer, which manages page allocation within large (4MB-sized) chunks of memory. [the diagram has a middle global layer, from which the per-CPU caches are refilled] The per-CPU layer manages one set of power-of-two pools for each processor. These pools are insulated from the other processors, and hence can be accessed without acquiring global locks. Allocation and release are fast in most cases, as only the local freelist is involved. Whenever the per-CPU freelist becomes empty, it can be replenished from the global layer, which maintains its own power-of-two pools. Likewise, excess buffers in the per-CPU cache can be returned to the global free list. As an optimization, buffers are moved between these two layers in target-sized groups (three buffers per move in the case shown in Figure 12-9), preventing unnecessary linked-list operations. To accomplish this, the per-CPU layer maintains two free lists, main and aux. Allocation and release primarily use the main free list. When this becomes empty, the buffers on aux are moved to main, and the aux list is replenished from the global layer. Likewise, when the main list overflows (size exceeds target), it is moved to aux, and the buffers on aux are returned to the global layer. This way the global layer is accessed at most once per target-number of accesses. The value of target is a tunable parameter. Increasing target reduces the number of global accesses, but ties up more buffers in per-CPU caches. ... 12.9.1 Analysis The Dynix algorithm provides efficient memory allocation for shared memory multiprocessors. It supports the standard System V interface, and allows memory to be exchanged between the allocator and the paging system. The per-CPU caches reduce the contention on the global lock, and the dual free lists provide a fast exchange of buffers between the per-CPU and global layers. It is interesting to contrast the Dynix coalescing approach with that of the Mach zone-based allocator. The Mach algorithm employs a mark-and-sweep method, linearly scanning the entire pool each time. This is computationally expensive, and hence is relegated to a separate background task. In Dynix, each time blocks are released [from the global layer] to the coalesce-to-page layer,the per page data structures are updated to accont for them. When all the buffers in a page are freed, the page can be returned to the paging system. This happens in the foreground, as part of the processing of release operations. The incremental cost for each release operation is small; hence it does not lead to unbounded worst case performance [unlike the Mach allocator]. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Mon Oct 30 23:32:15 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 08C8837B479; Mon, 30 Oct 2000 23:32:12 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id e9V7VUJ16831; Mon, 30 Oct 2000 23:31:30 -0800 (PST) (envelope-from dillon) Date: Mon, 30 Oct 2000 23:31:30 -0800 (PST) From: Matt Dillon Message-Id: <200010310731.e9V7VUJ16831@earth.backplane.com> To: David Greenman Cc: Bosko Milekic , freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists References: <200010301927.LAA01623@implode.root.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :> I recently wrote an initial "scratch pad" design for per-CPU mbuf :> lists (in the MP case). The design consists simply of introducing :> these "fast" lists for each CPU and populating them with mbufs on bootup. :> Allocations from these lists would not need to be protected with a mutex :> as each CPU has its own. The general mmbfree list remains, and remains :> protected with a mutex, in case the per-CPU list is empty. : : I have only one question - is the lock overhead really so high that this :is needed? : :-DG : :David Greenman The biggest benefit to per-cpu allocation pools is L1-cache locality, though being able to avoid locking can also be significant if you are doing lots of small allocations. However, the schemes that I see implemented today by Linux, for example, and Dynix too (as Terry pointed out) are actually considerably more complex then we really need. If those schemes can reap a 95% cache hit I would argue that implementing a simpler scheme that only reaps 85% might be better for us. For example, to reap most of the benefit we could simply implement a 5-10 slot 'quick cache' (also known as a working-set cache) in MALLOC()/FREE() and zalloc[i]()/zfree(). It is not necessary to keep big per-cpu pools. With small per-cpu pools and hystersis we reap most of the benefits but don't have to deal with any of the garbage collection or balancing issues. After seeing the hell the Linux folks are going through, I'd much prefer to avoid having to deal with balancing. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 2:52:38 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id A1D6A37B4D7; Tue, 31 Oct 2000 02:52:31 -0800 (PST) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id DAA25934; Tue, 31 Oct 2000 03:52:57 -0700 (MST) Received: from usr02.primenet.com(206.165.6.202) via SMTP by smtp05.primenet.com, id smtpdAAArgayLY; Tue Oct 31 03:52:47 2000 Received: (from tlambert@localhost) by usr02.primenet.com (8.8.5/8.8.5) id DAA29909; Tue, 31 Oct 2000 03:52:09 -0700 (MST) From: Terry Lambert Message-Id: <200010311052.DAA29909@usr02.primenet.com> Subject: Re: MP: per-CPU mbuf allocation lists To: dillon@earth.backplane.com (Matt Dillon) Date: Tue, 31 Oct 2000 10:52:08 +0000 (GMT) Cc: dg@root.com (David Greenman), bmilekic@dsuper.net (Bosko Milekic), freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG In-Reply-To: <200010310731.e9V7VUJ16831@earth.backplane.com> from "Matt Dillon" at Oct 30, 2000 11:31:30 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > The biggest benefit to per-cpu allocation pools is L1-cache locality, > though being able to avoid locking can also be significant if you are > doing lots of small allocations. I think that the locking avoidance win is being grossly underrated. Realize that a lock accessed by multiple processors will require distribute coherency, and will either have to live in uncached memory, or will have to take an update cycle worth of overhead. I would be real interested in cycle counting in and out of the Big Giant Lock; I think that it would be quite telling, even in the current implementation. The reason for the limitation of "four processors , and you hit the point of diminishing returns" myth is architecture. The limit you are hitting when you hit "the point of diminishing returns" is really where your software architecture falls over dead. For Sequent, this was 32, and now it's up to 64, with the new NUMA-Q system (which, since it runs on Monterray, can run Linux programs, BTW). The limit for SVR4, which had a slab allocator and a single contended lock (being largely derived from the Solaris code of the day) was 4. > However, the schemes that I see implemented today by Linux, for example, > and Dynix too (as Terry pointed out) are actually considerably more > complex then we really need. If those schemes can reap a 95% cache > hit I would argue that implementing a simpler scheme that only reaps > 85% might be better for us. The Linux stuff is pretty baraoque, from what I've seen of it; they are pretty much doomed because of some early decisions in their VM system, which they are now stuck with, if they want to maintain their level of platform support without a serious rethink. > For example, to reap most of the benefit we could simply implement a > 5-10 slot 'quick cache' (also known as a working-set cache) in > MALLOC()/FREE() and zalloc[i]()/zfree(). It is not necessary to keep > big per-cpu pools. With small per-cpu pools and hystersis we reap most > of the benefits but don't have to deal with any of the garbage collection > or balancing issues. After seeing the hell the Linux folks are going > through, I'd much prefer to avoid having to deal with balancing. The example from the Dynix allocator was "three elements per list, a running allocation of 4-6 elements max" (this was with a target of 3). It's really trivial to implement this: /* something to cast about; allocations can be no smaller than this*/ struct thing { struct thing *next; /* linked list of 'target'*/ struct thing *tail; /* tail of 'target' we are member of*/ }; /* our system-wide freelist*/ struct global_layer { mutex_t gl_lock; /* global layer lock*/ struct thing *global_list; /* list of 'target' clusters*/ struct thing *bucket_list; /* list of sub-'target' individuals*/ int target; /* our target size*/ }; /* our per-CPU freelist*/ struct per_cpu { struct thing *main_freelist; /* unit pool*/ struct thing *aux_freelist; /* cluster of unit pool*/ int target; }; The obvious, of course: struct myobject * alloc_from_per_cpu() { struct thing *rv; rv = percpup->main_freelist; if( rv->next) { percpup->main_freelist = rv->next; } else { /* * Note: we could just mark this, and deal with it * later (e.g. if we are in an interrupt handler). */ percpup->main_freelist = percpup->aux_freelist; LOCK(globalp->gl_lock); if( ( percpup->aux_freelist = globalp->global_list) == NULL) { /* we really are out of memory!*/ ... } else { globalp->global_list = globalp->global_list->tail->next; percpup->aux_freelist->tail = NULL; } UNLOCK(globalp->gl_lock); } /* maybe do some initialization, if needed; probably in caller...*/ return( (struct myobject *)rv); } Deallocation goes back to the main_freelist. When there are target items already on the list, the list on aux_freelist is given back to the system, as a cluster. Again, this could be delayed until the CPU was idle, or whatever. The point is that the tail needs setting during the clustering operation; right now, that's a linear traversal, though we could cheat, and fill toward the tail, and pull off the head, which would let us maintain the tail pointer at the head. The tradeoff is really based on incremental vs. linear cost, and if we can really delay, so it's partly dependent on the size of 'target'. The coalescing of freed pages is a system task. Basically, the system frees to a coelesce-to-page layer in the Dynix model; there's no reason FreeBSD couldn't go on using the same model it uses now for doing this, though it will have the same unbounded worst case performance FreeBSD has today. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 8:49:34 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 8A2BD37B479; Tue, 31 Oct 2000 08:49:31 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id e9VGnOm18947; Tue, 31 Oct 2000 08:49:24 -0800 (PST) (envelope-from dillon) Date: Tue, 31 Oct 2000 08:49:24 -0800 (PST) From: Matt Dillon Message-Id: <200010311649.e9VGnOm18947@earth.backplane.com> To: Terry Lambert Cc: dg@root.com (David Greenman), bmilekic@dsuper.net (Bosko Milekic), freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists References: <200010311052.DAA29909@usr02.primenet.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :It's really trivial to implement this: : :/* something to cast about; allocations can be no smaller than this*/ :struct thing { :... You've just described our zalloc[i]() memory subsystem with a few simple changes. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 9:47:27 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id D274137B4CF for ; Tue, 31 Oct 2000 09:47:17 -0800 (PST) Received: from harmony.village.org (harmony.village.org [10.0.0.6]) by rover.village.org (8.11.0/8.11.0) with ESMTP id e9VHlFn25054 for ; Tue, 31 Oct 2000 10:47:15 -0700 (MST) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id KAA80353 for ; Tue, 31 Oct 2000 10:47:14 -0700 (MST) Message-Id: <200010311747.KAA80353@harmony.village.org> To: arch@freebsd.org Subject: Like to commit my diskprep Date: Tue, 31 Oct 2000 10:47:14 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I'd like to commit my diskprep program and go fix all the cross references in the man pages to include a pointer to it. THe crrent diskprep code is in http://people.freebsd.org/~imp/diskprep I need to write a man page for it as well. However, I've had hundreds of people report to me that this just works for them. This is on a wide variety of versions from 4.0-stable through current. It requires no kernel changes and writes out both the mbr and freebsd disk label so modern bioses don't get confused. usage: diskprep short-dev-name where short-dev-name is something like ad8 or da4. I'd put it in /usr/sbin rather than /sbin because it does require perl. I'd also mae it build only on i386 at this time because the alpha port doesn't use MBR disks as the SRM consoles don't use MBR disks. Other ports whose boot loader needs to se a MBR disk would likely need to have it built for them (the ia64 and the mythical k64 come to mind). I did this a as a perl script because disklabel and fdisk were hard to use together and Bruce strongly objected to my changing disklabel. He didn't want to mix the various layers of disk labeling into disklabel itself. Before anyone asks, the biggest difference between my diskprep and Matt's recent changes are that diskprep doesn't introduce a new api into the kernel and doesn't pollute disklabel with functions it traditionally hasn't done. Matt's changes put functionality into edisklabel and the kernel. Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 10: 1:51 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id ABCB637B479 for ; Tue, 31 Oct 2000 10:01:48 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id e9VI1if19601; Tue, 31 Oct 2000 10:01:44 -0800 (PST) (envelope-from dillon) Date: Tue, 31 Oct 2000 10:01:44 -0800 (PST) From: Matt Dillon Message-Id: <200010311801.e9VI1if19601@earth.backplane.com> To: Warner Losh Cc: arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200010311747.KAA80353@harmony.village.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :I did this a as a perl script because disklabel and fdisk were hard to :use together and Bruce strongly objected to my changing disklabel. He :didn't want to mix the various layers of disk labeling into disklabel :itself. : :Before anyone asks, the biggest difference between my diskprep and :Matt's recent changes are that diskprep doesn't introduce a new api :into the kernel and doesn't pollute disklabel with functions it :traditionally hasn't done. Matt's changes put functionality into :edisklabel and the kernel. : :Warner I would welcome diskrep as a port, but it makes absolutely no sense to commit it to the main tree as a /usr/bin program when the functionality should properly be placed in the disklabel program. Why abandon disklabel when that's the program everyone already knows how to use, and when fixing it is so fraggin easy? -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 10:36:58 2000 Delivered-To: freebsd-arch@freebsd.org Received: from critter.freebsd.dk (flutter.freebsd.dk [212.242.40.147]) by hub.freebsd.org (Postfix) with ESMTP id 6411837B479 for ; Tue, 31 Oct 2000 10:36:56 -0800 (PST) Received: from critter (localhost [127.0.0.1]) by critter.freebsd.dk (8.11.1/8.9.3) with ESMTP id e9VIaou14194; Tue, 31 Oct 2000 19:36:50 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: Warner Losh Cc: arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-Reply-To: Your message of "Tue, 31 Oct 2000 10:47:14 MST." <200010311747.KAA80353@harmony.village.org> Date: Tue, 31 Oct 2000 19:36:50 +0100 Message-ID: <14192.973017410@critter> From: Poul-Henning Kamp Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message <200010311747.KAA80353@harmony.village.org>, Warner Losh writes: > >I'd like to commit my diskprep program and go fix all the cross >references in the man pages to include a pointer to it. THe crrent >diskprep code is in > > http://people.freebsd.org/~imp/diskprep Yes! My only request is that you disallow the bogus diskprep ad0 syntax and require people to enter a full and decent path: diskprep /dev/ad0 -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 11:31:13 2000 Delivered-To: freebsd-arch@freebsd.org Received: from pike.osd.bsdi.com (pike.osd.bsdi.com [204.216.28.222]) by hub.freebsd.org (Postfix) with ESMTP id CD6E137B479 for ; Tue, 31 Oct 2000 11:31:11 -0800 (PST) Received: from laptop.baldwin.cx (john@dhcp241.osd.bsdi.com [204.216.28.241]) by pike.osd.bsdi.com (8.11.0/8.9.3) with ESMTP id e9VJUMf15024; Tue, 31 Oct 2000 11:30:22 -0800 (PST) (envelope-from jhb@FreeBSD.org) Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <200010311747.KAA80353@harmony.village.org> Date: Tue, 31 Oct 2000 11:31:23 -0800 (PST) From: John Baldwin To: Warner Losh Subject: RE: Like to commit my diskprep Cc: arch@FreeBSD.org Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 31-Oct-00 Warner Losh wrote: > Before anyone asks, the biggest difference between my diskprep and > Matt's recent changes are that diskprep doesn't introduce a new api > into the kernel and doesn't pollute disklabel with functions it > traditionally hasn't done. Matt's changes put functionality into > edisklabel and the kernel. Actually, I would think that creating a virgin disklabel would be part of disklabel's job. After all, doesn't it make sense to use the disklabel program to create/edit disklabel's? > Warner -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 11:33:51 2000 Delivered-To: freebsd-arch@freebsd.org Received: from pike.osd.bsdi.com (pike.osd.bsdi.com [204.216.28.222]) by hub.freebsd.org (Postfix) with ESMTP id A5CCB37B4C5 for ; Tue, 31 Oct 2000 11:33:48 -0800 (PST) Received: from laptop.baldwin.cx (john@dhcp241.osd.bsdi.com [204.216.28.241]) by pike.osd.bsdi.com (8.11.0/8.9.3) with ESMTP id e9VJUHf15019; Tue, 31 Oct 2000 11:30:17 -0800 (PST) (envelope-from jhb@FreeBSD.org) Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <14192.973017410@critter> Date: Tue, 31 Oct 2000 11:31:18 -0800 (PST) From: John Baldwin To: Poul-Henning Kamp Subject: Re: Like to commit my diskprep Cc: arch@FreeBSD.org, Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 31-Oct-00 Poul-Henning Kamp wrote: > In message <200010311747.KAA80353@harmony.village.org>, Warner Losh writes: >> >>I'd like to commit my diskprep program and go fix all the cross >>references in the man pages to include a pointer to it. THe crrent >>diskprep code is in >> >> http://people.freebsd.org/~imp/diskprep > > Yes! > > My only request is that you disallow the bogus > > diskprep ad0 > > syntax and require people to enter a full and decent path: > > diskprep /dev/ad0 Why? Both fdisk(8) and disklabel(8) don't require the full path. Requiring it for diskprep would violate POLA IMO. -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 11:39:42 2000 Delivered-To: freebsd-arch@freebsd.org Received: from critter.freebsd.dk (flutter.freebsd.dk [212.242.40.147]) by hub.freebsd.org (Postfix) with ESMTP id 1BED837B4C5; Tue, 31 Oct 2000 11:39:37 -0800 (PST) Received: from critter (localhost [127.0.0.1]) by critter.freebsd.dk (8.11.1/8.9.3) with ESMTP id e9VJdZu14677; Tue, 31 Oct 2000 20:39:35 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: John Baldwin Cc: arch@FreeBSD.ORG, Warner Losh Subject: Re: Like to commit my diskprep In-Reply-To: Your message of "Tue, 31 Oct 2000 11:31:18 PST." Date: Tue, 31 Oct 2000 20:39:35 +0100 Message-ID: <14675.973021175@critter> From: Poul-Henning Kamp Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message , John Baldwin writes: >> My only request is that you disallow the bogus >> >> diskprep ad0 >> >> syntax and require people to enter a full and decent path: >> >> diskprep /dev/ad0 > >Why? Both fdisk(8) and disklabel(8) don't require the full path. >Requiring it for diskprep would violate POLA IMO. For that to be consistent we would need to allow for example: mount ad0s1a /mnt That would be bogus. In unix we refer to devices by pathname, we should stick to that. The magic "Hmm, I'll try sticking /dev in front" DWIM code is bad IMO. I don't want us to propagte that mistake. disks don't have their own private namespace, neither do tapes or ttys. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 12:49:33 2000 Delivered-To: freebsd-arch@freebsd.org Received: from pike.osd.bsdi.com (pike.osd.bsdi.com [204.216.28.222]) by hub.freebsd.org (Postfix) with ESMTP id 22BAA37B4D7 for ; Tue, 31 Oct 2000 12:49:31 -0800 (PST) Received: from laptop.baldwin.cx (john@dhcp241.osd.bsdi.com [204.216.28.241]) by pike.osd.bsdi.com (8.11.0/8.9.3) with ESMTP id e9VKjuf18233; Tue, 31 Oct 2000 12:45:56 -0800 (PST) (envelope-from jhb@FreeBSD.org) Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <14675.973021175@critter> Date: Tue, 31 Oct 2000 12:46:57 -0800 (PST) From: John Baldwin To: Poul-Henning Kamp Subject: Re: Like to commit my diskprep Cc: Warner Losh , arch@FreeBSD.org Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 31-Oct-00 Poul-Henning Kamp wrote: > In message , John Baldwin writes: > >>> My only request is that you disallow the bogus >>> >>> diskprep ad0 >>> >>> syntax and require people to enter a full and decent path: >>> >>> diskprep /dev/ad0 >> >>Why? Both fdisk(8) and disklabel(8) don't require the full path. >>Requiring it for diskprep would violate POLA IMO. > > For that to be consistent we would need to allow for example: > > mount ad0s1a /mnt > > That would be bogus. > > In unix we refer to devices by pathname, we should stick to that. > > The magic "Hmm, I'll try sticking /dev in front" DWIM code is > bad IMO. Fair enough. Should we then start changing our other tools to make the bogus form deprecated and warn the user with the intention of axeing it altogether in 6.0 or some such? -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 12:57:43 2000 Delivered-To: freebsd-arch@freebsd.org Received: from critter.freebsd.dk (flutter.freebsd.dk [212.242.40.147]) by hub.freebsd.org (Postfix) with ESMTP id E283937B688; Tue, 31 Oct 2000 12:57:37 -0800 (PST) Received: from critter (localhost [127.0.0.1]) by critter.freebsd.dk (8.11.1/8.9.3) with ESMTP id e9VKvZu15118; Tue, 31 Oct 2000 21:57:35 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: John Baldwin Cc: Warner Losh , arch@FreeBSD.org Subject: Re: Like to commit my diskprep In-Reply-To: Your message of "Tue, 31 Oct 2000 12:46:57 PST." Date: Tue, 31 Oct 2000 21:57:35 +0100 Message-ID: <15116.973025855@critter> From: Poul-Henning Kamp Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message , John Baldwin writes: >Fair enough. Should we then start changing our other tools to >make the bogus form deprecated and warn the user with the intention >of axeing it altogether in 6.0 or some such? Something like that yes. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 13:29:59 2000 Delivered-To: freebsd-arch@freebsd.org Received: from dragon.nuxi.com (trang.nuxi.com [209.152.133.57]) by hub.freebsd.org (Postfix) with ESMTP id AD60837B4C5 for ; Tue, 31 Oct 2000 13:29:56 -0800 (PST) Received: (from obrien@localhost) by dragon.nuxi.com (8.9.3/8.9.1) id NAA28540; Tue, 31 Oct 2000 13:29:46 -0800 (PST) (envelope-from obrien) Date: Tue, 31 Oct 2000 13:29:46 -0800 From: "David O'Brien" To: Warner Losh Cc: arch@freebsd.org Subject: Re: Like to commit my diskprep Message-ID: <20001031132945.B28476@dragon.nuxi.com> Reply-To: obrien@freebsd.org References: <200010311747.KAA80353@harmony.village.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200010311747.KAA80353@harmony.village.org>; from imp@village.org on Tue, Oct 31, 2000 at 10:47:14AM -0700 X-Operating-System: FreeBSD 5.0-CURRENT Organization: The NUXI BSD group X-Pgp-Rsa-Fingerprint: B7 4D 3E E9 11 39 5F A3 90 76 5D 69 58 D9 98 7A X-Pgp-Rsa-Keyid: 1024/34F9F9D5 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Tue, Oct 31, 2000 at 10:47:14AM -0700, Warner Losh wrote: > > I'd like to commit my diskprep program and go fix all the cross > references in the man pages to include a pointer to it. My reservation is do we need yet another official way to prepare disks? I'm thinking from a tech support stand point -- too many choice just causes confusion. Since Matt's bits do fix the basic problem, is diskprep OBE except maybe as a wrapper for fdisk & disklabel? -- -- David (obrien@FreeBSD.org) GNU is Not Unix / Linux Is Not UniX To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 14:20:15 2000 Delivered-To: freebsd-arch@freebsd.org Received: from field.videotron.net (field.videotron.net [205.151.222.108]) by hub.freebsd.org (Postfix) with ESMTP id 9796037B479 for ; Tue, 31 Oct 2000 14:20:14 -0800 (PST) Received: from modemcable213.3-201-24.mtl.mc.videotron.ca ([24.201.3.213]) by field.videotron.net (Sun Internet Mail Server sims.3.5.1999.12.14.10.29.p8) with ESMTP id <0G3B004N3FDP3C@field.videotron.net> for freebsd-arch@FreeBSD.ORG; Tue, 31 Oct 2000 17:20:13 -0500 (EST) Date: Tue, 31 Oct 2000 17:24:48 -0500 (EST) From: Bosko Milekic Subject: Re: MP: per-CPU mbuf allocation lists In-reply-to: <200010310731.e9V7VUJ16831@earth.backplane.com> X-Sender: bmilekic@jehovah.technokratis.com To: Matt Dillon Cc: freebsd-arch@FreeBSD.ORG Message-id: MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Mon, 30 Oct 2000, Matt Dillon wrote: > For example, to reap most of the benefit we could simply implement a > 5-10 slot 'quick cache' (also known as a working-set cache) in > MALLOC()/FREE() and zalloc[i]()/zfree(). It is not necessary to keep > big per-cpu pools. With small per-cpu pools and hystersis we reap most > of the benefits but don't have to deal with any of the garbage collection > or balancing issues. After seeing the hell the Linux folks are going > through, I'd much prefer to avoid having to deal with balancing. > > -Matt So, anyone planning to do some MALLOC() optimizing? :-) Cheers, Bosko Milekic bmilekic@technokratis.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 14:25:54 2000 Delivered-To: freebsd-arch@freebsd.org Received: from berserker.bsdi.com (berserker.twistedbit.com [199.79.183.1]) by hub.freebsd.org (Postfix) with ESMTP id 74FDF37B4C5 for ; Tue, 31 Oct 2000 14:25:51 -0800 (PST) Received: from berserker.bsdi.com (cp@localhost.bsdi.com [127.0.0.1]) by berserker.bsdi.com (8.9.3/8.9.3) with ESMTP id PAA04504; Tue, 31 Oct 2000 15:25:33 -0700 (MST) (envelope-from cp@berserker.bsdi.com) Message-Id: <200010312225.PAA04504@berserker.bsdi.com> To: Bosko Milekic Cc: Matt Dillon , freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists In-reply-to: Your message of "Tue, 31 Oct 2000 17:24:48 EST." From: Chuck Paterson Date: Tue, 31 Oct 2000 15:25:33 -0700 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I would really really like to encourage anyone who wants to do this type of work to please first help get more stuff out from under Giant so we can start getting this thing to be act more like a SMP system and less like a MP system that can't take interrupts in the kernel. Chuck } } So, anyone planning to do some MALLOC() optimizing? :-) } } Cheers, } Bosko Milekic } bmilekic@technokratis.com } To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 14:39:56 2000 Delivered-To: freebsd-arch@freebsd.org Received: from gw.nectar.com (gw.nectar.com [208.42.49.153]) by hub.freebsd.org (Postfix) with ESMTP id C849F37B4C5; Tue, 31 Oct 2000 14:39:54 -0800 (PST) Received: from hamlet.nectar.com (hamlet.nectar.com [10.0.1.102]) by gw.nectar.com (Postfix) with ESMTP id 2984E192A7; Tue, 31 Oct 2000 16:39:54 -0600 (CST) Received: (from nectar@localhost) by hamlet.nectar.com (8.11.1/8.9.3) id e9VMdsd19032; Tue, 31 Oct 2000 16:39:54 -0600 (CST) (envelope-from nectar@spawn.nectar.com) Date: Tue, 31 Oct 2000 16:39:54 -0600 From: "Jacques A. Vidrine" To: Poul-Henning Kamp Cc: John Baldwin , arch@FreeBSD.ORG, Warner Losh Subject: Re: Like to commit my diskprep Message-ID: <20001031163953.B18974@hamlet.nectar.com> References: <14675.973021175@critter> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <14675.973021175@critter>; from phk@critter.freebsd.dk on Tue, Oct 31, 2000 at 08:39:35PM +0100 X-Url: http://www.nectar.com/ Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Tue, Oct 31, 2000 at 08:39:35PM +0100, Poul-Henning Kamp wrote: > The magic "Hmm, I'll try sticking /dev in front" DWIM code is > bad IMO. I always had the impression that the utilities that took a bare name like da0s1 did so in part to save the user from figuring out whether /dev/da0s1 or /dev/rda0s1 was appropriate. That distinction is gone now. -- Jacques Vidrine / n@nectar.com / jvidrine@verio.net / nectar@FreeBSD.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 14:40: 9 2000 Delivered-To: freebsd-arch@freebsd.org Received: from critter.freebsd.dk (flutter.freebsd.dk [212.242.40.147]) by hub.freebsd.org (Postfix) with ESMTP id 2DF9337B663 for ; Tue, 31 Oct 2000 14:40:06 -0800 (PST) Received: from critter (localhost [127.0.0.1]) by critter.freebsd.dk (8.11.1/8.9.3) with ESMTP id e9VMdmu15780; Tue, 31 Oct 2000 23:39:48 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: Chuck Paterson Cc: Bosko Milekic , Matt Dillon , freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists In-Reply-To: Your message of "Tue, 31 Oct 2000 15:25:33 MST." <200010312225.PAA04504@berserker.bsdi.com> Date: Tue, 31 Oct 2000 23:39:48 +0100 Message-ID: <15778.973031988@critter> From: Poul-Henning Kamp Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message <200010312225.PAA04504@berserker.bsdi.com>, Chuck Paterson writes: > I would really really like to encourage anyone who wants >to do this type of work to please first help get more stuff out >from under Giant so we can start getting this thing to be act >more like a SMP system and less like a MP system that can't take >interrupts in the kernel. I would tend to agree. This is not the time for line by line optimizations based on generated assembler code. This is the time to make sure things work, and to get them to work right. *then* we will optimize the hell out of it :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 14:55:43 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 5FAA337B4C5; Tue, 31 Oct 2000 14:55:40 -0800 (PST) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id PAA23030; Tue, 31 Oct 2000 15:52:20 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp04.primenet.com, id smtpdAAAlXayRS; Tue Oct 31 15:52:09 2000 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id PAA17563; Tue, 31 Oct 2000 15:55:12 -0700 (MST) From: Terry Lambert Message-Id: <200010312255.PAA17563@usr09.primenet.com> Subject: Re: MP: per-CPU mbuf allocation lists To: dillon@earth.backplane.com (Matt Dillon) Date: Tue, 31 Oct 2000 22:55:11 +0000 (GMT) Cc: tlambert@primenet.com (Terry Lambert), dg@root.com (David Greenman), bmilekic@dsuper.net (Bosko Milekic), freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG In-Reply-To: <200010311649.e9VGnOm18947@earth.backplane.com> from "Matt Dillon" at Oct 31, 2000 08:49:24 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > :It's really trivial to implement this: > : > :/* something to cast about; allocations can be no smaller than this*/ > :struct thing { > :... > > You've just described our zalloc[i]() memory subsystem with a few simple > changes. I know. I told you it was trivial... Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 15:15:32 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id 81D0837B4C5; Tue, 31 Oct 2000 15:15:28 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YBCWW; Tue, 31 Oct 2000 18:15:29 -0500 Reply-To: Randell Jesup To: John Baldwin Cc: Warner Losh , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: From: Randell Jesup Date: 31 Oct 2000 18:19:04 -0500 In-Reply-To: John Baldwin's message of "Tue, 31 Oct 2000 11:31:23 -0800 (PST)" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG John Baldwin writes: >> Before anyone asks, the biggest difference between my diskprep and >> Matt's recent changes are that diskprep doesn't introduce a new api >> into the kernel and doesn't pollute disklabel with functions it >> traditionally hasn't done. Matt's changes put functionality into >> edisklabel and the kernel. > >Actually, I would think that creating a virgin disklabel would be >part of disklabel's job. After all, doesn't it make sense to use >the disklabel program to create/edit disklabel's? Yes. And it should. Right now disklabel (and many of the other disk tools) are basically dusty decks - many of their defaults date back several eons. Look at the default partitioning that's set up (the values in /etc/disktab and the way it's used), etc. The last time an HD was added to disktab was 1993 and it was a 100MB Conner. The documentation for fdisk hasn't changed since 1996. Sysinstall sets the default newfs args to "-b 8192 -f 1024", even though there's a CVS note from 1999 that says The default is still "-b 8192 -f 1024" but my experiments show that "-b 16384 -f 4096 -c 100" is a more sensible value for modern disksizes. -b 8192 and -f 1024 (and -c 16) are still the defaults in newfs. You get my drift - most of these defaults are wildly out-of-whack for even vaguely semi-modern hardware. Admins need to tune all sorts of things in order to get good setups - the defaults are insane. Fsck has all sorts of brain damage when it comes to certain types of corruption requiring someone to intimately understand ufs inode structures and on top of that how to edit them in-place with crufty tools. I don't feel we should just add more and more layers on top of badly out-of-date tools. Fix the tools. Update the defaults. Add options that are needed. Don't create Yet Another Semi-known Way to do something. We have enough of those already. IMHO -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 15:44:28 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 65D7937B4C5 for ; Tue, 31 Oct 2000 15:44:26 -0800 (PST) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id QAA15727; Tue, 31 Oct 2000 16:41:11 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp04.primenet.com, id smtpdAAAIhayBE; Tue Oct 31 16:40:59 2000 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id QAA18978; Tue, 31 Oct 2000 16:44:07 -0700 (MST) From: Terry Lambert Message-Id: <200010312344.QAA18978@usr09.primenet.com> Subject: Re: MP: per-CPU mbuf allocation lists To: cp@bsdi.com (Chuck Paterson) Date: Tue, 31 Oct 2000 23:44:06 +0000 (GMT) Cc: bmilekic@dsuper.net (Bosko Milekic), dillon@earth.backplane.com (Matt Dillon), freebsd-arch@FreeBSD.ORG In-Reply-To: <200010312225.PAA04504@berserker.bsdi.com> from "Chuck Paterson" at Oct 31, 2000 03:25:33 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > I would really really like to encourage anyone who wants > to do this type of work to please first help get more stuff out > from under Giant so we can start getting this thing to be act > more like a SMP system and less like a MP system that can't take > interrupts in the kernel. I still have grave reservations about running Intel hardware in virtual wire mode, and/or using kernel threads to process the interrupts, as a means of getting more than one processor into the kernel at a time. No BSD or Linux approach demonstrated so far has been able to compete successfully with the NT approach of wiring network card interrupts to particular processors. I will once again point out the paper that discusses the communications latency problem in using kernel threads, and their concommitant introduction of a communication delay. Multiprocessor Scheduling with Communication Delays B. Veltman, B. J. Lageweg, J. K. Lenstra Parallel Computing, Volume 16, 1990, pages 173-182 -- On the other hand, we know that significant concurrency can be achieved, even with a single Big Giant Lock, by removing resources from the conflict domain, rather than moving them to private conflict domains. Per CPU resources simply do not need locking or mutexes or atomic_t or similar protection: they are inherently MP-safe. Sequent was able to run 32 processors at nearly full tilt, and they _had_ a Big Giant Lock. The push-down emphasis has to be on MP-safety, not on synchronization, except where it's absolutely necessary. I think the question we ought to be asking ourselves is what _can't_ be moved from the global conflict domain into the per CPU domain. I also think it's silly to object to people like Alfred picking low hanging fruit in the networking code, merely because it's low hanging: at least picking it gets it the heck out of our way. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 15:48:57 2000 Delivered-To: freebsd-arch@freebsd.org Received: from berserker.bsdi.com (berserker.twistedbit.com [199.79.183.1]) by hub.freebsd.org (Postfix) with ESMTP id ED75637B479 for ; Tue, 31 Oct 2000 15:48:53 -0800 (PST) Received: from berserker.bsdi.com (cp@localhost.bsdi.com [127.0.0.1]) by berserker.bsdi.com (8.9.3/8.9.3) with ESMTP id QAA05579; Tue, 31 Oct 2000 16:48:34 -0700 (MST) (envelope-from cp@berserker.bsdi.com) Message-Id: <200010312348.QAA05579@berserker.bsdi.com> To: Terry Lambert Cc: bmilekic@dsuper.net (Bosko Milekic), dillon@earth.backplane.com (Matt Dillon), freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists In-reply-to: Your message of "Tue, 31 Oct 2000 23:44:06 GMT." <200010312344.QAA18978@usr09.primenet.com> From: Chuck Paterson Date: Tue, 31 Oct 2000 16:48:34 -0700 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG I don't think we are going to be running the processors in virtual wire more and we won't be really using kernel threads once we get the light weight context switch code going, they are just threads if a lock blocks. As to wiring an interrupt to a given processor, there is nothing in the general implementation the prohibits this. If you can steer the interrupt to a given processor it can be handled on that processor. Chuck Terry Lambert wrote on: Tue, 31 Oct 2000 23:44:06 GMT }> I would really really like to encourage anyone who wants }> to do this type of work to please first help get more stuff out }> from under Giant so we can start getting this thing to be act }> more like a SMP system and less like a MP system that can't take }> interrupts in the kernel. } }I still have grave reservations about running Intel hardware in }virtual wire mode, and/or using kernel threads to process the }interrupts, as a means of getting more than one processor into }the kernel at a time. No BSD or Linux approach demonstrated }so far has been able to compete successfully with the NT approach }of wiring network card interrupts to particular processors. } }I will once again point out the paper that discusses the }communications latency problem in using kernel threads, }and their concommitant introduction of a communication delay. } }Multiprocessor Scheduling with Communication Delays }B. Veltman, B. J. Lageweg, J. K. Lenstra }Parallel Computing, Volume 16, 1990, pages 173-182 } }-- } }On the other hand, we know that significant concurrency can }be achieved, even with a single Big Giant Lock, by removing }resources from the conflict domain, rather than moving them }to private conflict domains. Per CPU resources simply do not }need locking or mutexes or atomic_t or similar protection: }they are inherently MP-safe. } }Sequent was able to run 32 processors at nearly full tilt, }and they _had_ a Big Giant Lock. The push-down emphasis has }to be on MP-safety, not on synchronization, except where it's }absolutely necessary. } }I think the question we ought to be asking ourselves is what }_can't_ be moved from the global conflict domain into the }per CPU domain. I also think it's silly to object to people }like Alfred picking low hanging fruit in the networking code, }merely because it's low hanging: at least picking it gets it }the heck out of our way. } } } Terry Lambert } terry@lambert.org }--- }Any opinions in this posting are my own and not those of my present }or previous employers. } } }To Unsubscribe: send mail to majordomo@FreeBSD.org }with "unsubscribe freebsd-arch" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 15:53:24 2000 Delivered-To: freebsd-arch@freebsd.org Received: from falla.videotron.net (falla.videotron.net [205.151.222.106]) by hub.freebsd.org (Postfix) with ESMTP id 95A4D37B4D7 for ; Tue, 31 Oct 2000 15:53:17 -0800 (PST) Received: from modemcable213.3-201-24.mtl.mc.videotron.ca ([24.201.3.213]) by falla.videotron.net (Sun Internet Mail Server sims.3.5.1999.12.14.10.29.p8) with ESMTP id <0G3B00743JOROY@falla.videotron.net> for freebsd-arch@FreeBSD.ORG; Tue, 31 Oct 2000 18:53:15 -0500 (EST) Date: Tue, 31 Oct 2000 18:57:50 -0500 (EST) From: Bosko Milekic Subject: Re: MP: per-CPU mbuf allocation lists In-reply-to: <200010312344.QAA18978@usr09.primenet.com> X-Sender: bmilekic@jehovah.technokratis.com To: Terry Lambert Cc: freebsd-arch@FreeBSD.ORG Message-id: MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Tue, 31 Oct 2000, Terry Lambert wrote: > On the other hand, we know that significant concurrency can > be achieved, even with a single Big Giant Lock, by removing > resources from the conflict domain, rather than moving them > to private conflict domains. Per CPU resources simply do not > need locking or mutexes or atomic_t or similar protection: > they are inherently MP-safe. Is this 100% accurate? Don't we still need to protect even the per-CPU lists with a lock just in case we get an interrupt and get rescheduled because of a higher priority thread that wants execution? Is this possible? If it isn't the case, then ignore the question, but if it is, I agree that it still makes sense to have per-CPU resources available, just because it the lock contention is minimized. > > Terry Lambert > terry@lambert.org > --- > Any opinions in this posting are my own and not those of my present > or previous employers. Bosko Milekic bmilekic@technokratis.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 15:57:16 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 37D0237B4C5; Tue, 31 Oct 2000 15:57:13 -0800 (PST) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id QAA22202; Tue, 31 Oct 2000 16:53:56 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp04.primenet.com, id smtpdAAAvlaioR; Tue Oct 31 16:53:45 2000 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id QAA19282; Tue, 31 Oct 2000 16:56:57 -0700 (MST) From: Terry Lambert Message-Id: <200010312356.QAA19282@usr09.primenet.com> Subject: Re: Like to commit my diskprep To: obrien@FreeBSD.ORG Date: Tue, 31 Oct 2000 23:56:57 +0000 (GMT) Cc: imp@village.org (Warner Losh), arch@FreeBSD.ORG In-Reply-To: <20001031132945.B28476@dragon.nuxi.com> from "David O'Brien" at Oct 31, 2000 01:29:46 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > I'd like to commit my diskprep program and go fix all the cross > > references in the man pages to include a pointer to it. > > My reservation is do we need yet another official way to prepare disks? > I'm thinking from a tech support stand point -- too many choice just > causes confusion. > > Since Matt's bits do fix the basic problem, is diskprep OBE except maybe > as a wrapper for fdisk & disklabel? You mean "yet another official way", like seperate fdisk and disklabel utilities? Isn't the following obvious to everyone: 1) fdisk and disklabel are tools for dividing a disk into N extents. 2) The only difference a user gives a damn about is that for one of these tools, N=4, and for the other, N!=4, and the user doesn't even care about that, if you were to parameterize it. 3) The kernel has to know about these structure in scope, or it couldn't read the things to externalize devices in the first place. 4) It's pretty easy to just eyeball the interfaces, and come up with an abstract ioctl() that can read and write any partitioning scheme known to the kernel, by scheme ID. 5) It'd be pretty easy to pull a list of permissable schemes out of the kernel by pulling it back across the user kernel boundary, from the kernel code to which the schemes must be known anyway. So why isn't there just one tool in user space that ioctl()'s down to see what's allowed, ioctl()'s down to read what's there, and ioctl()'s down to write out and/or delete stuff -- a single program that is nothing more than a shell for doing ioctl()'s, and knows how to do everything the kernel knows how to do? To hell with disklabel and all these other nasty programs that can get out of sync with the kernel's idea of the on disk partitioning layout, and confuse the user with "it's a slice!", "no! It's a partititon!" , "no, it's a DOS extended partition, which is a slice because we used the name ``partition'' before 1981!". Ugh. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 16:11: 6 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135]) by hub.freebsd.org (Postfix) with ESMTP id 6051337B479 for ; Tue, 31 Oct 2000 16:11:04 -0800 (PST) Received: (from daemon@localhost) by smtp05.primenet.com (8.9.3/8.9.3) id RAA02202; Tue, 31 Oct 2000 17:11:32 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp05.primenet.com, id smtpdAAAFmaOke; Tue Oct 31 17:11:19 2000 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id RAA19817; Tue, 31 Oct 2000 17:10:47 -0700 (MST) From: Terry Lambert Message-Id: <200011010010.RAA19817@usr09.primenet.com> Subject: Re: MP: per-CPU mbuf allocation lists To: bmilekic@dsuper.net (Bosko Milekic) Date: Wed, 1 Nov 2000 00:10:47 +0000 (GMT) Cc: tlambert@primenet.com (Terry Lambert), freebsd-arch@FreeBSD.ORG In-Reply-To: from "Bosko Milekic" at Oct 31, 2000 06:57:50 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > On the other hand, we know that significant concurrency can > > be achieved, even with a single Big Giant Lock, by removing > > resources from the conflict domain, rather than moving them > > to private conflict domains. Per CPU resources simply do not > > need locking or mutexes or atomic_t or similar protection: > > they are inherently MP-safe. > > Is this 100% accurate? Don't we still need to protect even the > per-CPU lists with a lock just in case we get an interrupt and get > rescheduled because of a higher priority thread that wants execution? Is > this possible? Are you assuming kernel threads with kernel preemption in interrupt context? If so, why? Kernel threads make interrupt processing times non-deterministic, so it's not like you could build a hard real time system with them in there: HRT needs determinism, or you can't do scheduling or deadlining. If you don't have HRT requirements, why do you need kernel preemption? You also never do allocations while in interrupt context (unless you're SVR4). I really don't think that you can end up with self-conteded resources. > If it isn't the case, then ignore the question, but if it is, I agree > that it still makes sense to have per-CPU resources available, just > because it the lock contention is minimized. It's not just minimized, it's totally eliminated, except in the cases where you have to go back to the well and drain or fill a per CPU pool. For that, you can grab the BGL. The only contended resources are things which are intrinsically stuck in the global space, like the directory vnode for /. For those, you mutex and reference count, as necessary (Dynix FS's were non-reentrant for a long while; it was one of their biggest warts: they didn't have the mutex capability on a per object basis... that had to wait for SVR4 ES/MP [4.0.2]). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 16:27: 5 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id BC0CD37B479 for ; Tue, 31 Oct 2000 16:27:02 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA10Qqg24508; Tue, 31 Oct 2000 16:26:52 -0800 (PST) (envelope-from dillon) Date: Tue, 31 Oct 2000 16:26:52 -0800 (PST) From: Matt Dillon Message-Id: <200011010026.eA10Qqg24508@earth.backplane.com> To: Terry Lambert Cc: bmilekic@dsuper.net (Bosko Milekic), tlambert@primenet.com (Terry Lambert), freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists References: <200011010010.RAA19817@usr09.primenet.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :> Is this 100% accurate? Don't we still need to protect even the :> per-CPU lists with a lock just in case we get an interrupt and get :> rescheduled because of a higher priority thread that wants execution? Is :> this possible? : :Are you assuming kernel threads with kernel preemption in interrupt :context? If so, why? Kernel threads make interrupt processing :times non-deterministic, so it's not like you could build a hard :real time system with them in there: HRT needs determinism, or you :can't do scheduling or deadlining. : :If you don't have HRT requirements, why do you need kernel :preemption? Keep in mind, guys, that the purpose is to avoid cache conflicts between cpu's. Having a per-cpu spinlock to protect a per-cpu list against preemption is perfectly acceptable, and very *VERY* fast not only because the cpu owns the cache line, but also because you don't have to bother with locked bus cycles (since the lock is per-cpu). -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 16:27:12 2000 Delivered-To: freebsd-arch@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id D134837B4E5 for ; Tue, 31 Oct 2000 16:27:09 -0800 (PST) Received: (from bright@localhost) by fw.wintelcom.net (8.10.0/8.10.0) id eA10R4013441; Tue, 31 Oct 2000 16:27:04 -0800 (PST) Date: Tue, 31 Oct 2000 16:27:04 -0800 From: Alfred Perlstein To: Bosko Milekic Cc: Terry Lambert , freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists Message-ID: <20001031162704.H22110@fw.wintelcom.net> References: <200010312344.QAA18978@usr09.primenet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.4i In-Reply-To: ; from bmilekic@dsuper.net on Tue, Oct 31, 2000 at 06:57:50PM -0500 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG * Bosko Milekic [001031 15:53] wrote: > > On Tue, 31 Oct 2000, Terry Lambert wrote: > > > On the other hand, we know that significant concurrency can > > be achieved, even with a single Big Giant Lock, by removing > > resources from the conflict domain, rather than moving them > > to private conflict domains. Per CPU resources simply do not > > need locking or mutexes or atomic_t or similar protection: > > they are inherently MP-safe. > > Is this 100% accurate? Don't we still need to protect even the > per-CPU lists with a lock just in case we get an interrupt and get > rescheduled because of a higher priority thread that wants execution? Is > this possible? > If it isn't the case, then ignore the question, but if it is, I agree > that it still makes sense to have per-CPU resources available, just > because it the lock contention is minimized. The possiblity of using spinlocks which will turn off interrupts on the CPU that has aquired the spinlock may be used for leaf locks held for a very short time. But that's getting way ahead of ourselves. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 19:32:25 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id CC06A37B4CF for ; Tue, 31 Oct 2000 19:32:22 -0800 (PST) Received: from billy-club.village.org (billy-club.village.org [10.0.0.3]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA13WGn27368; Tue, 31 Oct 2000 20:32:16 -0700 (MST) (envelope-from imp@billy-club.village.org) Received: from billy-club.village.org (localhost [127.0.0.1]) by billy-club.village.org (8.11.1/8.8.3) with ESMTP id eA13WVV41938; Tue, 31 Oct 2000 20:32:31 -0700 (MST) Message-Id: <200011010332.eA13WVV41938@billy-club.village.org> To: Matt Dillon Subject: Re: Like to commit my diskprep Cc: arch@FreeBSD.ORG In-reply-to: Your message of "Tue, 31 Oct 2000 10:01:44 PST." <200010311801.e9VI1if19601@earth.backplane.com> References: <200010311801.e9VI1if19601@earth.backplane.com> <200010311747.KAA80353@harmony.village.org> Date: Tue, 31 Oct 2000 20:32:31 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message <200010311801.e9VI1if19601@earth.backplane.com> Matt Dillon writes: : I would welcome diskrep as a port, but it makes absolutely no sense : to commit it to the main tree as a /usr/bin program when the functionality : should properly be placed in the disklabel program. Why abandon : disklabel when that's the program everyone already knows how to use, : and when fixing it is so fraggin easy? Diskprep uses disklabel(8) and fdisk(8) to do the right thing. It also allows one to easily build multiple disks that are mostly alike, but might have differing geometries (eg make / 50M, /usr 500M, /var 30M and the rest in /junk). Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 19:34:42 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id 1132537B479; Tue, 31 Oct 2000 19:34:35 -0800 (PST) Received: from billy-club.village.org (billy-club.village.org [10.0.0.3]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA13YVn27377; Tue, 31 Oct 2000 20:34:31 -0700 (MST) (envelope-from imp@billy-club.village.org) Received: from billy-club.village.org (localhost [127.0.0.1]) by billy-club.village.org (8.11.1/8.8.3) with ESMTP id eA13YlV41951; Tue, 31 Oct 2000 20:34:47 -0700 (MST) Message-Id: <200011010334.eA13YlV41951@billy-club.village.org> To: John Baldwin Subject: Re: Like to commit my diskprep Cc: arch@FreeBSD.org In-reply-to: Your message of "Tue, 31 Oct 2000 11:31:23 PST." References: Date: Tue, 31 Oct 2000 20:34:47 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message John Baldwin writes: : : On 31-Oct-00 Warner Losh wrote: : > Before anyone asks, the biggest difference between my diskprep and : > Matt's recent changes are that diskprep doesn't introduce a new api : > into the kernel and doesn't pollute disklabel with functions it : > traditionally hasn't done. Matt's changes put functionality into : > edisklabel and the kernel. : : Actually, I would think that creating a virgin disklabel would be : part of disklabel's job. After all, doesn't it make sense to use : the disklabel program to create/edit disklabel's? The problem is that FreeBSD/i386 needs to put two different labels on the disk. One is for the BIOS/legacy OSes that run on i386. The other is for FreeBSD. FreeBSD/alpha doesn't have this problem because it doesn't need the second set of labels. Other FreeBSD porst might need to have different things done in the future to make them co-exist with other native platforms. Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 19:41: 9 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id 87E7537B4FE; Tue, 31 Oct 2000 19:40:58 -0800 (PST) Received: from billy-club.village.org (billy-club.village.org [10.0.0.3]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA13eun27426; Tue, 31 Oct 2000 20:40:56 -0700 (MST) (envelope-from imp@billy-club.village.org) Received: from billy-club.village.org (localhost [127.0.0.1]) by billy-club.village.org (8.11.1/8.8.3) with ESMTP id eA13fCV42009; Tue, 31 Oct 2000 20:41:12 -0700 (MST) Message-Id: <200011010341.eA13fCV42009@billy-club.village.org> To: obrien@freebsd.org Subject: Re: Like to commit my diskprep Cc: arch@freebsd.org In-reply-to: Your message of "Tue, 31 Oct 2000 13:29:46 PST." <20001031132945.B28476@dragon.nuxi.com> References: <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> Date: Tue, 31 Oct 2000 20:41:11 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message <20001031132945.B28476@dragon.nuxi.com> "David O'Brien" writes: : Since Matt's bits do fix the basic problem, is diskprep OBE except maybe : as a wrapper for fdisk & disklabel? But Matt is unwilling to bring them into current, so they are not a contender as far as I'm concerned. I don't care how wonderful they are. Without someone willing to champion them into -current they don't exist as far as I'm concerned. They also violate the disk layering that we need to maintain as we move to new platforms. Also, diskprep allows one to "mass produce" disks in such a way that you have the same partitioning on all of them, except maybe one "hog" slice that picks up the extra bits that different geometries might require. Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 19:48:38 2000 Delivered-To: freebsd-arch@freebsd.org Received: from pike.osd.bsdi.com (pike.osd.bsdi.com [204.216.28.222]) by hub.freebsd.org (Postfix) with ESMTP id A102F37B4CF for ; Tue, 31 Oct 2000 19:48:35 -0800 (PST) Received: from laptop.baldwin.cx (john@dhcp241.osd.bsdi.com [204.216.28.241]) by pike.osd.bsdi.com (8.11.0/8.9.3) with ESMTP id eA13lkf31975; Tue, 31 Oct 2000 19:47:46 -0800 (PST) (envelope-from jhb@FreeBSD.org) Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <200011010334.eA13YlV41951@billy-club.village.org> Date: Tue, 31 Oct 2000 19:48:48 -0800 (PST) From: John Baldwin To: Warner Losh Subject: Re: Like to commit my diskprep Cc: arch@FreeBSD.org Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 01-Nov-00 Warner Losh wrote: > In message John Baldwin writes: >: >: On 31-Oct-00 Warner Losh wrote: >: > Before anyone asks, the biggest difference between my diskprep and >: > Matt's recent changes are that diskprep doesn't introduce a new api >: > into the kernel and doesn't pollute disklabel with functions it >: > traditionally hasn't done. Matt's changes put functionality into >: > edisklabel and the kernel. >: >: Actually, I would think that creating a virgin disklabel would be >: part of disklabel's job. After all, doesn't it make sense to use >: the disklabel program to create/edit disklabel's? > > The problem is that FreeBSD/i386 needs to put two different labels on > the disk. One is for the BIOS/legacy OSes that run on i386. The > other is for FreeBSD. FreeBSD/alpha doesn't have this problem because > it doesn't need the second set of labels. Other FreeBSD porst might > need to have different things done in the future to make them co-exist > with other native platforms. The disklabel we stick in a x86 slice is almost identical to the normal disklabel on an alpha disk. The only differences that I can recall are some minor layout tweaks that control where the boot code goes. Both of these are handled by the disklabel(8) program, which makes sense since they are nearly identical. The extra label that x86 uses is the MBR, which is managed by the x86-specific fdisk(8) command. All that Dillon's patch does is fix one broken case: creating a brand new disklabel (the BSD type of disklabel) inside of a new slice. The old disklabel(8) could already edit an existing disklabel in a slice, and some people managed to fake it along by telling it to use fd360 as the disktab entry for slices and other nasty stuff. fdisk(8) should manage the MBR that is present only on x86 and ia64, and disklabel(8) should manage the BSD-style disklabel that is used in UFS/FFS. I fail to see why this doesn't make sense. > Warner -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 19:50:51 2000 Delivered-To: freebsd-arch@freebsd.org Received: from sydney.worldwide.lemis.com (sydney.worldwide.lemis.com [192.109.197.167]) by hub.freebsd.org (Postfix) with ESMTP id 2427037B4C5; Tue, 31 Oct 2000 19:50:44 -0800 (PST) Received: (from grog@localhost) by wantadilla.lemis.com (8.11.0/8.9.3) id e973vqA28717; Sat, 7 Oct 2000 13:27:52 +0930 (CST) (envelope-from grog) Date: Sat, 7 Oct 2000 13:27:52 +0930 From: Greg Lehey To: Terry Lambert Cc: John Baldwin , Daniel Eischen , arch@FreeBSD.ORG, Alfred Perlstein , Mark Murray , Jake Burkholder , Boris Popov , freebsd-smp@FreeBSD.ORG Subject: Re: Mutexes and semaphores Message-ID: <20001007132752.A28665@wantadilla.lemis.com> References: <20001005113139.C27736@fw.wintelcom.net> <200010052142.OAA15421@usr05.primenet.com> <200009251938.MAA29311@usr02.primenet.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 1.0i In-Reply-To: <200009251938.MAA29311@usr02.primenet.com>; from tlambert@primenet.com on Mon, Sep 25, 2000 at 07:38:22PM +0000 Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia Phone: +61-8-8388-8286 Fax: +61-8-8388-8725 Mobile: +61-418-838-708 WWW-Home-Page: http://www.lemis.com/~grog X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF 13 24 52 F8 6D A4 95 EF Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Monday, 25 September 2000 at 19:38:22 +0000, Terry Lambert wrote: >>> If we are going to support recursive mutex, I think it would be >>> better to add separate calls/macros/data types to support them, >>> so the the mtx mutexes can be simplified. Calls to mtx_enter >>> with the recursive mutex type wouldn't even compile. >> >> Err, the recursive nature of the mutexes is very trivial. It >> doesn't affect the complexity of the mutexes at all. > > Yes, it does. Ownership precludes hand-off. Recusrion support > implies permission and tacit approval. > > A mutex is not recursive. There are things you simply can not > implement when recursion is permitted for all of your primitives. > > The most obvious argument is still that a mutex is intended to > protect data, not code. Recursion is only required if the mutex > is actually protecting reentrancy of code, not access to data. On Thursday, 5 October 2000 at 21:42:28 +0000, Terry Lambert wrote: >>> There is another problem; printf's inside a kthread corrupt like >>> crazy. They look very unthreadsafe. >> >> do NOT use printf without Giant. > > This strikes me as being rather inane. > > If printf won't work without holging the lock, then it damn well > should acquire the lock if it isn't already held, and release it > if it acquired it, before returning. Make up your mind. Greg -- Finger grog@lemis.com for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 19:52: 2 2000 Delivered-To: freebsd-arch@freebsd.org Received: from pike.osd.bsdi.com (pike.osd.bsdi.com [204.216.28.222]) by hub.freebsd.org (Postfix) with ESMTP id 4247337B4C5; Tue, 31 Oct 2000 19:52:00 -0800 (PST) Received: from laptop.baldwin.cx (john@dhcp241.osd.bsdi.com [204.216.28.241]) by pike.osd.bsdi.com (8.11.0/8.9.3) with ESMTP id eA13pBf32064; Tue, 31 Oct 2000 19:51:11 -0800 (PST) (envelope-from jhb@FreeBSD.org) Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <200011010341.eA13fCV42009@billy-club.village.org> Date: Tue, 31 Oct 2000 19:52:13 -0800 (PST) From: John Baldwin To: Warner Losh Subject: Re: Like to commit my diskprep Cc: arch@FreeBSD.org, obrien@FreeBSD.org Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 01-Nov-00 Warner Losh wrote: > In message <20001031132945.B28476@dragon.nuxi.com> "David O'Brien" writes: >: Since Matt's bits do fix the basic problem, is diskprep OBE except maybe >: as a wrapper for fdisk & disklabel? > > But Matt is unwilling to bring them into current, so they are not a > contender as far as I'm concerned. I don't care how wonderful they > are. Without someone willing to champion them into -current they > don't exist as far as I'm concerned. They also violate the disk > layering that we need to maintain as we move to new platforms. Actually, Jordan has already committed them to current. And they don't violate layering. The only layering violation we have is dangerously dedicated mode when done from disklabel via 'disklabel ad0 auto' instead of 'fdisk -I ad0 ; disklabel ad0s1 auto', which Matt's fixes actually let us do now. > Also, diskprep allows one to "mass produce" disks in such a way that > you have the same partitioning on all of them, except maybe one "hog" > slice that picks up the extra bits that different geometries might > require. It is a nice utility then, much like adduser is a nice utility. Whether or not it belongs in ports or the base system depends on which color of paint you prefer for your bikesheds.. > Warner -- John Baldwin -- http://www.FreeBSD.org/~jhb/ PGP Key: http://www.baldwin.cx/~john/pgpkey.asc "Power Users Use the Power to Serve!" - http://www.FreeBSD.org/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 20: 1:20 2000 Delivered-To: freebsd-arch@freebsd.org Received: from magnesium.net (toxic.magnesium.net [207.154.84.15]) by hub.freebsd.org (Postfix) with SMTP id B051937B4CF for ; Tue, 31 Oct 2000 20:01:18 -0800 (PST) Received: (qmail 68678 invoked by uid 1142); 1 Nov 2000 04:01:18 -0000 Date: 31 Oct 2000 20:01:18 -0800 Date: Tue, 31 Oct 2000 20:01:06 -0800 From: Jason Evans To: Chuck Paterson Cc: freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists Message-ID: <20001031200106.L48771@canonware.com> References: <200010312225.PAA04504@berserker.bsdi.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <200010312225.PAA04504@berserker.bsdi.com>; from cp@bsdi.com on Tue, Oct 31, 2000 at 03:25:33PM -0700 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Tue, Oct 31, 2000 at 03:25:33PM -0700, Chuck Paterson wrote: > I would really really like to encourage anyone who wants > to do this type of work to please first help get more stuff out > from under Giant so we can start getting this thing to be act > more like a SMP system and less like a MP system that can't take > interrupts in the kernel. I strongly agree with Chuck here. We have a lot of ground work to do before such optimizations are of any importance. This discussion is similar in nature to the long discussion of mutexes (recursive/non-recursive, APIs, yadda yadda) -- both are irrelevant to the current state of FreeBSD. What we really need right now is help in moving -current forward to the point that such discussions are relevant. This is just soaking up people's time and making the real work go slower. If you really want to make a difference, please consider how you can help solve the issues we need to address *right now*. Jason SMP project manager To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 20:14:54 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id 9179A37B4CF; Tue, 31 Oct 2000 20:14:37 -0800 (PST) Received: from billy-club.village.org (billy-club.village.org [10.0.0.3]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA14ETn27612; Tue, 31 Oct 2000 21:14:32 -0700 (MST) (envelope-from imp@billy-club.village.org) Received: from billy-club.village.org (localhost [127.0.0.1]) by billy-club.village.org (8.11.1/8.8.3) with ESMTP id eA14EiV42330; Tue, 31 Oct 2000 21:14:44 -0700 (MST) Message-Id: <200011010414.eA14EiV42330@billy-club.village.org> To: John Baldwin Subject: Re: Like to commit my diskprep Cc: arch@FreeBSD.org, obrien@FreeBSD.org In-reply-to: Your message of "Tue, 31 Oct 2000 19:52:13 PST." References: Date: Tue, 31 Oct 2000 21:14:43 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message John Baldwin writes: : Actually, Jordan has already committed them to current. And they don't : violate layering. The only layering violation we have is dangerously : dedicated mode when done from disklabel via 'disklabel ad0 auto' instead : of 'fdisk -I ad0 ; disklabel ad0s1 auto', which Matt's fixes actually : let us do now. Now that I'm less mobile, I've been able to take a close look at the code. It looks fairly good and will make diskprep's life easier in many ways. The new ioctl isn't strictly necessary, but does make life easier for disklabel to figure things out. I was able to divine this information from fdisk's output... I also see why I thought Matt's code added an mbr on top of that, which was a pilot error on my part. Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 22:23: 9 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 7CD9937B4CF for ; Tue, 31 Oct 2000 22:23:07 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA16Mxw26242; Tue, 31 Oct 2000 22:22:59 -0800 (PST) (envelope-from dillon) Date: Tue, 31 Oct 2000 22:22:59 -0800 (PST) From: Matt Dillon Message-Id: <200011010622.eA16Mxw26242@earth.backplane.com> To: Warner Losh Cc: arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200010311801.e9VI1if19601@earth.backplane.com> <200010311747.KAA80353@harmony.village.org> <200011010332.eA13WVV41938@billy-club.village.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :In message <200010311801.e9VI1if19601@earth.backplane.com> Matt Dillon writes: :: I would welcome diskrep as a port, but it makes absolutely no sense :: to commit it to the main tree as a /usr/bin program when the functionality :: should properly be placed in the disklabel program. Why abandon :: disklabel when that's the program everyone already knows how to use, :: and when fixing it is so fraggin easy? : :Diskprep uses disklabel(8) and fdisk(8) to do the right thing. It :also allows one to easily build multiple disks that are mostly alike, :but might have differing geometries (eg make / 50M, /usr 500M, /var :30M and the rest in /junk). : :Warner And this is better then simply fixing disklabel because .... ? -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Tue Oct 31 22:38:43 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 62A1437B4CF; Tue, 31 Oct 2000 22:38:41 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA16cdU26323; Tue, 31 Oct 2000 22:38:39 -0800 (PST) (envelope-from dillon) Date: Tue, 31 Oct 2000 22:38:39 -0800 (PST) From: Matt Dillon Message-Id: <200011010638.eA16cdU26323@earth.backplane.com> To: Warner Losh Cc: John Baldwin , arch@FreeBSD.ORG, obrien@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011010414.eA14EiV42330@billy-club.village.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :Now that I'm less mobile, I've been able to take a close look at the :code. It looks fairly good and will make diskprep's life easier in :many ways. The new ioctl isn't strictly necessary, but does make life :easier for disklabel to figure things out. I was able to divine this :information from fdisk's output... : :I also see why I thought Matt's code added an mbr on top of that, :which was a pilot error on my part. : :Warner Well, ok... now that someone has actually bothered to look at the patch :-(. The patch really isn't that big a deal. Keep in mind that the existing ioctl, DIOCGDINFO, was hacked to pieces years ago to pull double duty. For some (but not all devices), if there is no valid label the above function will return a virgin label. DIOCGDINFO also aliases the first FreeBSD slice to da0 if there is a valid slice (that isn't dangerously dedicated), god knows why. Even worse, on top of the massive confusion these hacks already cause, the dangerously dedicated label will tend to look like a real label to a DOS program or to fdisk showing FreeBSD on s4, but with a dangerously dedicated label you can't specify slice 4 in any of the FreeBSD disk handling commands, you can only specify the base disk. It makes a twisted sort of sense but the result is mass confusion. The original ioctl should never have been hacked to do double duty or to alias... there really should have been a DIOCGVIRGIN ioctl from the get-go and specifying 'da0' instead of 'da0s1' for a real slice should have never been hacked to work. The DIOCGDINFO ioctl should have been made to return an error if the requested disklabel did not exist rather then fake one up, and we should never have started aliasing da0s1 to da0 in the slice-case. I don't think we can unwind this mess easily, but we can at least make system programs like disklabel behave in a consistent and rational manner. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 3: 3:41 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by hub.freebsd.org (Postfix) with ESMTP id 8874637B4C5; Wed, 1 Nov 2000 03:03:37 -0800 (PST) Received: from bde.zeta.org.au (bde.zeta.org.au [203.2.228.102]) by mailman.zeta.org.au (8.8.7/8.8.7) with ESMTP id VAA26949; Wed, 1 Nov 2000 21:56:55 +1100 Date: Wed, 1 Nov 2000 21:56:42 +1100 (EST) From: Bruce Evans X-Sender: bde@besplex.bde.org To: "Jacques A. Vidrine" Cc: Poul-Henning Kamp , John Baldwin , arch@FreeBSD.ORG, Warner Losh Subject: Re: Like to commit my diskprep In-Reply-To: <20001031163953.B18974@hamlet.nectar.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Tue, 31 Oct 2000, Jacques A. Vidrine wrote: > I always had the impression that the utilities that took a bare name > like da0s1 did so in part to save the user from figuring out whether > /dev/da0s1 or /dev/rda0s1 was appropriate. That distinction is gone > now. It also saves them from figuring out whether the device name needs an 'a' or 'c' or 'd' partition suffix (or no suffix). Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 6:59: 3 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail-relay.eunet.no (mail-relay.eunet.no [193.71.71.242]) by hub.freebsd.org (Postfix) with ESMTP id BE94E37B4D7; Wed, 1 Nov 2000 06:59:00 -0800 (PST) Received: from login-1.eunet.no (login-1.eunet.no [193.75.110.2]) by mail-relay.eunet.no (8.9.3/8.9.3/GN) with ESMTP id PAA04996; Wed, 1 Nov 2000 15:58:52 +0100 (CET) (envelope-from mbendiks@eunet.no) Received: from localhost (mbendiks@localhost) by login-1.eunet.no (8.9.3/8.8.8) with ESMTP id PAA02796; Wed, 1 Nov 2000 15:58:52 +0100 (CET) (envelope-from mbendiks@eunet.no) X-Authentication-Warning: login-1.eunet.no: mbendiks owned process doing -bs Date: Wed, 1 Nov 2000 15:58:52 +0100 (CET) From: Marius Bendiksen To: "John W. De Boskey" Cc: Kris Kennaway , Wilko Bulte , Warner Losh , obrien@FreeBSD.ORG, arch@FreeBSD.ORG Subject: Re: cvs commit: src/release/scripts dokern.sh In-Reply-To: <20001025150314.A62263@bsdwins.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > > I think a lot of people use this feature. > scripted installs... man sysinstall I think he was referring to people using the interactive feature. Marius To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 7:32:47 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail-relay.eunet.no (mail-relay.eunet.no [193.71.71.242]) by hub.freebsd.org (Postfix) with ESMTP id 7CF0637B479; Wed, 1 Nov 2000 07:32:45 -0800 (PST) Received: from login-1.eunet.no (login-1.eunet.no [193.75.110.2]) by mail-relay.eunet.no (8.9.3/8.9.3/GN) with ESMTP id QAA09903; Wed, 1 Nov 2000 16:32:43 +0100 (CET) (envelope-from mbendiks@eunet.no) Received: from localhost (mbendiks@localhost) by login-1.eunet.no (8.9.3/8.8.8) with ESMTP id QAA02994; Wed, 1 Nov 2000 16:32:43 +0100 (CET) (envelope-from mbendiks@eunet.no) X-Authentication-Warning: login-1.eunet.no: mbendiks owned process doing -bs Date: Wed, 1 Nov 2000 16:32:43 +0100 (CET) From: Marius Bendiksen To: John Baldwin Cc: arch@FreeBSD.org Subject: Re: Like to commit my diskprep In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > Fair enough. Should we then start changing our other tools to > make the bogus form deprecated and warn the user with the intention > of axeing it altogether in 6.0 or some such? This sounds like a good idea to me, though I'd personally prefer to axe it in -CURRENT straight off (as the people using current are tracking the relevant lists and will see the heads-up), and let it stay for 5.0-R. Just put the warnings about deprecation into -STABLE. Marius To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 9:29:43 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id 79EB937B4CF; Wed, 1 Nov 2000 09:29:40 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YBMVL; Wed, 1 Nov 2000 12:29:40 -0500 Reply-To: Randell Jesup To: Warner Losh Cc: obrien@FreeBSD.ORG, arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> From: Randell Jesup Date: 01 Nov 2000 12:33:20 -0500 In-Reply-To: Warner Losh's message of "Tue, 31 Oct 2000 20:41:11 -0700" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Warner Losh writes: >Also, diskprep allows one to "mass produce" disks in such a way that >you have the same partitioning on all of them, except maybe one "hog" >slice that picks up the extra bits that different geometries might >require. IMHO disklabel should have always been able to do that anyways. Fix disklabel. If you still need a better UI/skin/X interface on top of disklabel, fine, but fix disklabel (and fdisk, newfs, etc) first. -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 9:35:12 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id B40AE37B4CF; Wed, 1 Nov 2000 09:35:07 -0800 (PST) Received: from harmony.village.org (harmony.village.org [10.0.0.6]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA1HZ4n31744; Wed, 1 Nov 2000 10:35:05 -0700 (MST) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id KAA54300; Wed, 1 Nov 2000 10:35:03 -0700 (MST) Message-Id: <200011011735.KAA54300@harmony.village.org> To: Randell Jesup Subject: Re: Like to commit my diskprep Cc: obrien@FreeBSD.ORG, arch@FreeBSD.ORG In-reply-to: Your message of "01 Nov 2000 12:33:20 EST." References: <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> Date: Wed, 01 Nov 2000 10:35:03 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message Randell Jesup writes: : Warner Losh writes: : >Also, diskprep allows one to "mass produce" disks in such a way that : >you have the same partitioning on all of them, except maybe one "hog" : >slice that picks up the extra bits that different geometries might : >require. : : IMHO disklabel should have always been able to do that anyways. : Fix disklabel. If you still need a better UI/skin/X interface on top of : disklabel, fine, but fix disklabel (and fdisk, newfs, etc) first. How should I fix disklabel? There's currently no syntax for the concept of a "hog" partition in this disklabel, or any other one that I've seen (except for Solbourne's interactive one, but it didn't have a non-interactive way to do that). This functionality would arguably be a big wart on disklabel, but then again disklabel isn't going to win any beauty contests anytime soon. OpenBSD does have a disklabel -E which is akin to the Solboune interactive disk label program. Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 9:53:29 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id CA82E37B479; Wed, 1 Nov 2000 09:53:26 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA1HrCV29435; Wed, 1 Nov 2000 09:53:12 -0800 (PST) (envelope-from dillon) Date: Wed, 1 Nov 2000 09:53:12 -0800 (PST) From: Matt Dillon Message-Id: <200011011753.eA1HrCV29435@earth.backplane.com> To: Warner Losh Cc: Randell Jesup , obrien@FreeBSD.ORG, arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> <200011011735.KAA54300@harmony.village.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :: :: IMHO disklabel should have always been able to do that anyways. :: Fix disklabel. If you still need a better UI/skin/X interface on top of :: disklabel, fine, but fix disklabel (and fdisk, newfs, etc) first. : :How should I fix disklabel? There's currently no syntax for the :concept of a "hog" partition in this disklabel, or any other one that :I've seen (except for Solbourne's interactive one, but it didn't have :a non-interactive way to do that). This functionality would arguably :be a big wart on disklabel, but then again disklabel isn't going to :win any beauty contests anytime soon. OpenBSD does have a disklabel :-E which is akin to the Solboune interactive disk label program. : :Warner I don't see there being an issue here. We have fdisk, which creates a hog slice just fine. And we have disklabel, which applies a FreeBSD label to a disk or a slice (with my patch) just fine. Two programs, two functions. I suppose if we wanted to give disklabel an option to create a hog partition, it could simply exec 'fdisk -BI' and then auto-label the resulting slice. There is no particular need to put a create-hog-parttion function directly into disklabel. Disklabel currently only messes around with the DOS partition if you tell it to create a dangerously dedicated disklabel. The only real problem here is that the user doesn't necessarily realize he just did that because the command line looks roughly the same as the command line for creating a label in a bootable slice. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 10:12:45 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id CAC0437B4CF; Wed, 1 Nov 2000 10:12:41 -0800 (PST) Received: from harmony.village.org (harmony.village.org [10.0.0.6]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA1ICbn31942; Wed, 1 Nov 2000 11:12:38 -0700 (MST) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id LAA97417; Wed, 1 Nov 2000 11:12:37 -0700 (MST) Message-Id: <200011011812.LAA97417@harmony.village.org> To: Matt Dillon Subject: Re: Like to commit my diskprep Cc: Randell Jesup , obrien@FreeBSD.ORG, arch@FreeBSD.ORG In-reply-to: Your message of "Wed, 01 Nov 2000 09:53:12 PST." <200011011753.eA1HrCV29435@earth.backplane.com> References: <200011011753.eA1HrCV29435@earth.backplane.com> <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> <200011011735.KAA54300@harmony.village.org> Date: Wed, 01 Nov 2000 11:12:37 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message <200011011753.eA1HrCV29435@earth.backplane.com> Matt Dillon writes: : I don't see there being an issue here. We have fdisk, which creates : a hog slice just fine. And we have disklabel, which applies a : FreeBSD label to a disk or a slice (with my patch) just fine. Two : programs, two functions. : : I suppose if we wanted to give disklabel an option to create a hog : partition, it could simply exec 'fdisk -BI' and then auto-label the : resulting slice. There is no particular need to put a : create-hog-parttion function directly into disklabel. Disklabel : currently only messes around with the DOS partition if you tell it : to create a dangerously dedicated disklabel. The only real problem : here is that the user doesn't necessarily realize he just did that : because the command line looks roughly the same as the command line : for creating a label in a bootable slice. A "hog" partition is one that soaks up all the rest of the slice after other partitions are carved out. That's different than what you are describing. In the Solbourne tool, I'd tell it, approximately: Create partition a how big? 32M type? fs mount point? / Create parition b how big? 64m type? swap Create parition d how big? hog mount point? /usr Create partition e how big? 32M type? fs mount point? /var Commit! and the tool would create these partitions, giving all the disk space that was left after the non-hog partitions to the hog partition. If you had a 300M disk, then partition d would be 300M-128M= 172M. This is useful if you are also producing disks that are 290M or 310M in size, as you can imagine. This is different than what you are describing, which is the ability to have disklabel create a slice that covers the entire disk. Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 11:56:29 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 2230437B4CF; Wed, 1 Nov 2000 11:56:28 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA1JuJO30728; Wed, 1 Nov 2000 11:56:19 -0800 (PST) (envelope-from dillon) Date: Wed, 1 Nov 2000 11:56:19 -0800 (PST) From: Matt Dillon Message-Id: <200011011956.eA1JuJO30728@earth.backplane.com> To: Warner Losh Cc: Randell Jesup , obrien@FreeBSD.ORG, arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011011753.eA1HrCV29435@earth.backplane.com> <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> <200011011735.KAA54300@harmony.village.org> <200011011812.LAA97417@harmony.village.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :and the tool would create these partitions, giving all the disk space :that was left after the non-hog partitions to the hog partition. If :you had a 300M disk, then partition d would be 300M-128M= 172M. This :is useful if you are also producing disks that are 290M or 310M in :size, as you can imagine. : :This is different than what you are describing, which is the ability :to have disklabel create a slice that covers the entire disk. : :Warner All disklabel does is install a disklabel on a slice. If you want to automatically allocate the remaining free space in an existing dos partition table into a new slice, that sounds like a relatively trivial job for fdisk to accomplish. Disklabel could then be used to label the new slice. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 12:13: 8 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id 8A5F237B479; Wed, 1 Nov 2000 12:12:56 -0800 (PST) Received: from harmony.village.org (harmony.village.org [10.0.0.6]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA1KCtn32550; Wed, 1 Nov 2000 13:12:55 -0700 (MST) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id NAA98466; Wed, 1 Nov 2000 13:12:55 -0700 (MST) Message-Id: <200011012012.NAA98466@harmony.village.org> To: Matt Dillon Subject: Re: Like to commit my diskprep as a port Cc: Randell Jesup , obrien@FreeBSD.ORG, arch@FreeBSD.ORG In-reply-to: Your message of "Wed, 01 Nov 2000 11:56:19 PST." <200011011956.eA1JuJO30728@earth.backplane.com> References: <200011011956.eA1JuJO30728@earth.backplane.com> <200011011753.eA1HrCV29435@earth.backplane.com> <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> <200011011735.KAA54300@harmony.village.org> <200011011812.LAA97417@harmony.village.org> Date: Wed, 01 Nov 2000 13:12:54 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [[ just fyi, i've desided to make diskprep a port for now ]] In message <200011011956.eA1JuJO30728@earth.backplane.com> Matt Dillon writes: : automatically allocate the remaining free space in an existing dos : partition table into a new slice, that sounds like a relatively trivial : job for fdisk to accomplish. Disklabel could then be used to label the : new slice. The hog concept has nothing to do with DOS slices. It is a flexible BSD partition that grows or shrinks as other partitions take place. Its final size isn't known until all the other partitions are sized. There can be only one hog partition. In the previous example: disk size: 301M 291M 311M slice / da0s1a 32M 32M 32M swap da0s1b 64M 64M 64M all da0s1c 300M 290M 310M /usr da0s1d 172M 162M 182M /var da0s1e 32M 32M 32M fdisk -I da0 would be used in all cases to create da0s1 of 300M, 290M and 310M repsectively since 1M would be reserved for the MBR, just for the sake of easy math. diskprep lets you create config files that would look like: $hog_part = 'd'; $part{a}{type} = "4.2BSD"; $part{a}{size} = 32*1024; $part{b}{type} = "swap"; $part{b}{size} = 64*1024; $part{d}{type} = "4.2BSD"; $part{e}{type} = "4.2BSD"; $part{e}{size} = 32*1024; It will also do a newfs on these partitions (if 4.2BSD) and has some hooks to do minor fs tuning (minfree and the like). It is a more complete tool for preparing a disk for use than disklabel, especially in an environment where you have to deal with lots of different geometries and disk sizes. We have 10 different CF parts in production right now with 8 different sizes. Some of the sizes are different by a few blocks, while others are different by several megabytes (we have 32M, 45, 48M and 64M CF parts from 5 different MFGs and only two pairs of them have the same exact size). One big drawback is that it isn't interactive, nor does it have a nice gui. However, one could layer a simple curses or X interface on top of it very easily. The user interacts with the pretty pictures, hits do it. The UI writes out a file like above and calls diskprep with it. At the present time, I've not had a need for a UI beyond the config file, so I've just kept it simple. Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 12:21: 9 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id E4AEB37B479; Wed, 1 Nov 2000 12:21:06 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YB3DA; Wed, 1 Nov 2000 15:21:12 -0500 Reply-To: Randell Jesup To: Warner Losh Cc: Matt Dillon , Randell Jesup , obrien@FreeBSD.ORG, arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011011753.eA1HrCV29435@earth.backplane.com> <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> <200011011735.KAA54300@harmony.village.org> <200011011812.LAA97417@harmony.village.org> From: Randell Jesup Date: 01 Nov 2000 15:24:52 -0500 In-Reply-To: Warner Losh's message of "Wed, 01 Nov 2000 11:12:37 -0700" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Warner Losh writes: >A "hog" partition is one that soaks up all the rest of the slice after >other partitions are carved out. That's different than what you are >describing. [snip description of solbourne tool w/ hog] >This is different than what you are describing, which is the ability >to have disklabel create a slice that covers the entire disk. I want the equivalent of the solbourne tool (along with Matt's mods), though from your example its UI still leaves a lot to be desired. I'd like to be able to specify this sort of thing using disklabel. I'd like to dump disktab and make disklabel smarter about laying out partitions on "unknown" (read all but floppy nowadays) disks. Not to mention that (as best I can tell from the manpage) the 'auto' option to disklabel can only be used if you want it to be written to the disk - you can't just see what disklabel would do unless you have a partition to trash. Perhaps there's a way to avoid that by ^C'ing it before it writes - but I DON'T trust it not to write a new label, at least not from the man page. I'm not against better user-level tools. I'm against not fixing problems/holes in the lower-level tools. -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 12:38:36 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (rover.village.org [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id 0F69F37B4C5; Wed, 1 Nov 2000 12:38:33 -0800 (PST) Received: from harmony.village.org (harmony.village.org [10.0.0.6]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA1KcVn32698; Wed, 1 Nov 2000 13:38:31 -0700 (MST) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id NAA98734; Wed, 1 Nov 2000 13:38:30 -0700 (MST) Message-Id: <200011012038.NAA98734@harmony.village.org> To: Randell Jesup Subject: Re: Like to commit my diskprep Cc: Matt Dillon , obrien@FreeBSD.ORG, arch@FreeBSD.ORG In-reply-to: Your message of "01 Nov 2000 15:24:52 EST." References: <200011011753.eA1HrCV29435@earth.backplane.com> <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> <200011011735.KAA54300@harmony.village.org> <200011011812.LAA97417@harmony.village.org> Date: Wed, 01 Nov 2000 13:38:30 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message Randell Jesup writes: : I want the equivalent of the solbourne tool (along with Matt's : mods), though from your example its UI still leaves a lot to be desired. : I'd like to be able to specify this sort of thing using disklabel. I'd : like to dump disktab and make disklabel smarter about laying out : partitions on "unknown" (read all but floppy nowadays) disks. The UI in the Solbourne tool was good. My example was way stripped down. : Not to mention that (as best I can tell from the manpage) the : 'auto' option to disklabel can only be used if you want it to be written : to the disk - you can't just see what disklabel would do unless you have : a partition to trash. Perhaps there's a way to avoid that by ^C'ing it : before it writes - but I DON'T trust it not to write a new label, at least : not from the man page. Looking at the changes, all this seems to do is to make it possible to get the hoked up disk label. disklabel da0s1 should already report that. : I'm not against better user-level tools. I'm against not fixing : problems/holes in the lower-level tools. The problem is making sure that things get fixed in the right way. Matt's changes are a little gross, but get things fixed for now. The grossness comes from the hardwiring some defaults and from not using other sources of information about the disk, but it does have the advantage of being in -current and working as far as I can tell. One could just as easily construct this using DIOCGSLICEINFO and digging around in there for the whole disk slice. What concerns me about the patch is that it uses the whole disk slice rather than the slice of the disk requested. fdisk -I makes a slice that is, typically, 63 secotors smaller than whole disk. What also concerns me about his patch is that he didn't use the already extant clone_label to make sure that all of the fields were filled in correctly. The only "problem" with using it is that it mallocs the new label, but a free in the code would solve that problem. It looks like there might be a bug with clone_label at the moment in that it doesn't set the p_offset for the lp1->d_partions[RAW_PART] and assumes that it is set to 0. This should be true almost always, which is likely why no one else has seen the problem. :-) Now, where did I put that current box for testing... Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 12:55:37 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 7F53137B4C5; Wed, 1 Nov 2000 12:55:35 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA1KtUO31302; Wed, 1 Nov 2000 12:55:30 -0800 (PST) (envelope-from dillon) Date: Wed, 1 Nov 2000 12:55:30 -0800 (PST) From: Matt Dillon Message-Id: <200011012055.eA1KtUO31302@earth.backplane.com> To: Warner Losh Cc: Randell Jesup , obrien@FreeBSD.ORG, arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011011753.eA1HrCV29435@earth.backplane.com> <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> <200011011735.KAA54300@harmony.village.org> <200011011812.LAA97417@harmony.village.org> <200011012038.NAA98734@harmony.village.org> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :What concerns me about the patch is that it uses the whole disk slice :rather than the slice of the disk requested. fdisk -I makes a slice :that is, typically, 63 secotors smaller than whole disk. What also Huh? It uses the whole-disk partition OUT OF THE SLICE structure, i.e. for da0s1 (as an example), not out of the structure for da0. so the 63 sectors is already accounted for. I think you are confusing your structures. Use fdisk to create two freebsd slices rather then just one. Then disklabel them: disklabel -w -r da0s1 auto disklabel -w -r da0s2 auto You will note that disklabel produces the correct partition sizes. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 16:46:54 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id 8A07537B4CF; Wed, 1 Nov 2000 16:46:50 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YBRFG; Wed, 1 Nov 2000 19:46:56 -0500 Reply-To: Randell Jesup To: Warner Losh Cc: obrien@FreeBSD.ORG, arch@FreeBSD.ORG, Matt Dillon Subject: Re: Like to commit my diskprep References: <20001031132945.B28476@dragon.nuxi.com> <200010311747.KAA80353@harmony.village.org> <200011010341.eA13fCV42009@billy-club.village.org> <200011011735.KAA54300@harmony.village.org> From: Randell Jesup Date: 01 Nov 2000 19:50:37 -0500 In-Reply-To: Warner Losh's message of "Wed, 01 Nov 2000 10:35:03 -0700" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Warner Losh writes: >: IMHO disklabel should have always been able to do that anyways. >: Fix disklabel. If you still need a better UI/skin/X interface on top of >: disklabel, fine, but fix disklabel (and fdisk, newfs, etc) first. > >How should I fix disklabel? There's currently no syntax for the >concept of a "hog" partition in this disklabel, or any other one that >I've seen (except for Solbourne's interactive one, but it didn't have >a non-interactive way to do that). This functionality would arguably >be a big wart on disklabel, but then again disklabel isn't going to >win any beauty contests anytime soon. OpenBSD does have a disklabel >-E which is akin to the Solboune interactive disk label program. Well -E would be a darn good start (not knowing any more about it than you mentioned here). Adding 'hog' to the syntax when editing/using a label would be nice. Totally revising the syntax would be nice. Perhaps something like: # size offset fstype [fsize bsize bps/cpg] a: 50M 0 4.2BSD 1024 8192 16 b: 100M * swap c: * 0 unused f: 500M * 4.2BSD h: * * vinum (or maybe h: remaining * for a hog partition, or "hog", etc) instead of: # size offset fstype [fsize bsize bps/cpg] a: 81920 0 4.2BSD 1024 8192 16 # (Cyl. 0 - 84*) b: 160000 81920 swap # (Cyl. 84* - 218*) c: 1173930 0 unused 0 0 # (Cyl. 0 - 1211*) f: 2000000 211920 4.2BSD 1024 8192 16 # h: 962010 2211920 vinum # Basically, allow "*" to mean "use the appropriate default", and allow sizes to be specified in (blocks)/K/M/G. I have a patch for this that fixes all sorts of other weaknesses in disklabel in the offing (fully written, not yet fully tested). I got annoyed at merely discussing this. It also improves error reporting, and checks for overlapping partitions (excluding 'c' of course). -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Wed Nov 1 22:49:54 2000 Delivered-To: freebsd-arch@freebsd.org Received: from homer.softweyr.com (bsdconspiracy.net [208.187.122.220]) by hub.freebsd.org (Postfix) with ESMTP id 7AFDF37B479; Wed, 1 Nov 2000 22:49:52 -0800 (PST) Received: from [127.0.0.1] (helo=softweyr.com ident=Fools trust ident!) by homer.softweyr.com with esmtp (Exim 3.16 #1) id 13rE9W-0000Gp-00; Wed, 01 Nov 2000 23:47:10 -0700 Message-ID: <3A010DEE.83323DF@softweyr.com> Date: Wed, 01 Nov 2000 23:47:10 -0700 From: Wes Peters Organization: Softweyr LLC X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.2.12 i386) X-Accept-Language: en MIME-Version: 1.0 To: Poul-Henning Kamp Cc: John Baldwin , Warner Losh , arch@FreeBSD.org Subject: Re: Like to commit my diskprep References: <15116.973025855@critter> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Poul-Henning Kamp wrote: > > In message , John Baldwin writes: > > >Fair enough. Should we then start changing our other tools to > >make the bogus form deprecated and warn the user with the intention > >of axeing it altogether in 6.0 or some such? > > Something like that yes. It's an anachronism that was better solved by filename completion in the shell. -- "Where am I, and what am I doing in this handbasket?" Wes Peters Softweyr LLC wes@softweyr.com http://softweyr.com/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 7: 5:59 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail-relay.eunet.no (mail-relay.eunet.no [193.71.71.242]) by hub.freebsd.org (Postfix) with ESMTP id 1F87C37B666 for ; Thu, 2 Nov 2000 07:05:56 -0800 (PST) Received: from login-1.eunet.no (login-1.eunet.no [193.75.110.2]) by mail-relay.eunet.no (8.9.3/8.9.3/GN) with ESMTP id QAA66758; Thu, 2 Nov 2000 16:05:54 +0100 (CET) (envelope-from mbendiks@eunet.no) Received: from localhost (mbendiks@localhost) by login-1.eunet.no (8.9.3/8.8.8) with ESMTP id QAA10293; Thu, 2 Nov 2000 16:05:53 +0100 (CET) (envelope-from mbendiks@eunet.no) X-Authentication-Warning: login-1.eunet.no: mbendiks owned process doing -bs Date: Thu, 2 Nov 2000 16:05:53 +0100 (CET) From: Marius Bendiksen To: Randell Jesup Cc: arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Not to bring out the paint early, but I have a suggestion, should the concept of hog partitions be introduced (regardless of whether you stick them in disklabel, diskpart, or yadisklabel3): make it possible to define multiple variable-sized partitions, with percentile ratio to use from the hog-space, ie. / 64m /var 128m /usr 50% /home 50% That would yield more flexibility, at a (hopefully) low additional cost in code. Marius To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 8:32:45 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id E9AFA37B479 for ; Thu, 2 Nov 2000 08:32:42 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA2GWZ138286; Thu, 2 Nov 2000 08:32:35 -0800 (PST) (envelope-from dillon) Date: Thu, 2 Nov 2000 08:32:35 -0800 (PST) From: Matt Dillon Message-Id: <200011021632.eA2GWZ138286@earth.backplane.com> To: Marius Bendiksen Cc: Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :Not to bring out the paint early, but I have a suggestion, should the :concept of hog partitions be introduced (regardless of whether you stick :them in disklabel, diskpart, or yadisklabel3): make it possible to define :multiple variable-sized partitions, with percentile ratio to use from the :hog-space, ie. : :/ 64m :/var 128m :/usr 50% :/home 50% : :That would yield more flexibility, at a (hopefully) low additional cost in :code. : :Marius : With the size of hard disks today I'm not sure there would be much need, since generally you will want to specify fixed size partitions for all but the last one. For me: / 128M (so I can have a bunch of debug kernels) swap 2G /var 128M (bigger if this is a mail machine) /var/tmp 128M /usr 2G /data1 (remainder) (/home placed in /, /usr, or /data1 depending) e.g. there wouldn't be a whole lot of need for a 10G /usr. Once hard drives got big enough I just left it at 2G. One thing I am finding myself doing a lot these days is increasing the block size for things like /data1 - that will often have fewer larger files. FreeBSD4 reserves 16K of VM per struct buf no matter what, so increasing the block size from 8K to 16K is a breeze. Larger block sizes will put more pressure on the buffer cache and may still have heavy-load deadlock situations , but should also generally work. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 8:58:35 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id 6CBD737B479 for ; Thu, 2 Nov 2000 08:58:31 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YBZ74; Thu, 2 Nov 2000 11:58:28 -0500 Reply-To: Randell Jesup To: Matt Dillon Cc: Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011021632.eA2GWZ138286@earth.backplane.com> From: Randell Jesup Date: 02 Nov 2000 12:02:13 -0500 In-Reply-To: Matt Dillon's message of "Thu, 2 Nov 2000 08:32:35 -0800 (PST)" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Matt Dillon writes: >:/ 64m >:/var 128m >:/usr 50% >:/home 50% >: >:That would yield more flexibility, at a (hopefully) low additional cost in >:code. Not that hard. I'll look into it in my mods. > With the size of hard disks today I'm not sure there would be much > need, since generally you will want to specify fixed size partitions > for all but the last one. For me: Perhaps. Everyone has their own preference and own situation. For you all-but-one fixed works. I might want all of them defined as percentages perhaps, or all but /. > One thing I am finding myself doing a lot these days is increasing the > block size for things like /data1 - that will often have fewer larger > files. FreeBSD4 reserves 16K of VM per struct buf no matter what, so > increasing the block size from 8K to 16K is a breeze. Larger block > sizes will put more pressure on the buffer cache and may still have > heavy-load deadlock situations , but should also generally work. The defaults for -b and -f and -c for newfs/etc are WOEFULLY out-of-date. See the sysinstall checkin comment I referenced. I use 16K myself. It's possible larger might be better, especially for large partitions - perhaps make it variable on partition size.... And 16 for cpg is truely criminal (can you say thousands of spare root blocks? And very slow newfs?) -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 9:25:47 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id E545F37B4CF for ; Thu, 2 Nov 2000 09:25:41 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA2HPeM38718; Thu, 2 Nov 2000 09:25:40 -0800 (PST) (envelope-from dillon) Date: Thu, 2 Nov 2000 09:25:40 -0800 (PST) From: Matt Dillon Message-Id: <200011021725.eA2HPeM38718@earth.backplane.com> To: Randell Jesup Cc: Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011021632.eA2GWZ138286@earth.backplane.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : The defaults for -b and -f and -c for newfs/etc are WOEFULLY :out-of-date. See the sysinstall checkin comment I referenced. I use 16K :myself. It's possible larger might be better, especially for large :partitions - perhaps make it variable on partition size.... And 16 for cpg :is truely criminal (can you say thousands of spare root blocks? And very :slow newfs?) : :-- :Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) :rjesup@wgate.com Not to mention the bytes/inode (-i) If you want fsck to go fast on a big filesystem, reducing the number of inodes helps a lot. I find myself using -i 32768 or -i 65536 or even higher numbers on partitions which hold big database files. da1s1d: newfs /dev/da1s1d mount df -i Filesystem 1K-blocks Used Avail Capacity iused ifree %iused /dev/da1s1d 70491666 2 64852332 0% 1 8962045 0% time fsck /dev/da1s1d 1 files, 1 used, 35245832 free (16 frags, 4405727 blocks, 0.0% fragmentation) 14.681u 2.100s 1:05.92 25.4% 247+6201k 0+1io 0pf+0w newfs -i 32768 /dev/da1s1d mount df -i Filesystem 1K-blocks Used Avail Capacity iused ifree %iused /dev/da1s1d 71331858 2 65625308 0% 1 2240509 0% time fsck /dev/da1s1d 1 files, 1 used, 35665928 free (16 frags, 4458239 blocks, 0.0% fragmentation) 7.480u 0.693s 0:38.34 21.3% 241+6018k 0+1io 0pf+0w Combining it with your -c suggestion (my god, the -c default is ridiculously low! Can we change the default?) newfs -i 32768 -c 100 /dev/da1s1d mount df -i Filesystem 1K-blocks Used Avail Capacity iused ifree %iused /dev/da1s1d 71389936 2 65678740 0% 1 2246397 0% time fsck /dev/da1s1d 1 files, 1 used, 35694967 free (15 frags, 4461869 blocks, 0.0% fragmentation) 7.584u 0.556s 0:18.82 43.1% 248+6172k 0+1io 0pf+0w ^^^^ 18 seconds vs 66 seconds in the default case. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 11:55:24 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp.med.und.nodak.edu (smtp.med.und.NoDak.edu [134.129.166.20]) by hub.freebsd.org (Postfix) with ESMTP id 627FE37B4C5 for ; Thu, 2 Nov 2000 11:55:22 -0800 (PST) Received: from geo.med.und.nodak.edu ([134.129.166.11] helo=medicine.nodak.edu) by smtp.med.und.nodak.edu with esmtp (Exim 3.16 #1) id 13rQSG-000I44-00 for arch@FreeBSD.ORG; Thu, 02 Nov 2000 13:55:20 -0600 Message-ID: <3A01C6A3.23C4CD61@medicine.nodak.edu> Date: Thu, 02 Nov 2000 13:55:15 -0600 From: Barry Pederson X-Mailer: Mozilla 4.75 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011021632.eA2GWZ138286@earth.backplane.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Randell Jesup wrote: > > > The defaults for -b and -f and -c for newfs/etc are WOEFULLY > out-of-date. See the sysinstall checkin comment I referenced. I use 16K > myself. It's possible larger might be better, especially for large > partitions - perhaps make it variable on partition size.... And 16 for cpg > is truely criminal (can you say thousands of spare root blocks? And very > slow newfs?) The man page for newfs says: --------- BUGS The boot code of FreeBSD assumes that the file system that carries the kernel has blocks of 8 kilobytes and fragments of 1 kilobyte. You will not be able to boot from a file system that uses another size. --------- So I'd assume you have to be careful to leave the root at the current defaults? (or make the boot code smarter?) Barry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 13:14:11 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail-relay.eunet.no (mail-relay.eunet.no [193.71.71.242]) by hub.freebsd.org (Postfix) with ESMTP id 16B1E37B4C5 for ; Thu, 2 Nov 2000 13:14:09 -0800 (PST) Received: from login-1.eunet.no (login-1.eunet.no [193.75.110.2]) by mail-relay.eunet.no (8.9.3/8.9.3/GN) with ESMTP id WAA94090; Thu, 2 Nov 2000 22:14:00 +0100 (CET) (envelope-from mbendiks@eunet.no) Received: from localhost (mbendiks@localhost) by login-1.eunet.no (8.9.3/8.8.8) with ESMTP id WAA13266; Thu, 2 Nov 2000 22:14:00 +0100 (CET) (envelope-from mbendiks@eunet.no) X-Authentication-Warning: login-1.eunet.no: mbendiks owned process doing -bs Date: Thu, 2 Nov 2000 22:14:00 +0100 (CET) From: Marius Bendiksen To: Matt Dillon Cc: Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-Reply-To: <200011021632.eA2GWZ138286@earth.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > e.g. there wouldn't be a whole lot of need for a 10G /usr. Once hard > drives got big enough I just left it at 2G. This is a matter of preference (hence the reference to paint) and also the use intended for the system in question. However, the code is not going to be significantly more complex due to this, and I think it's a much better, ie cleaner, way of doing it. > One thing I am finding myself doing a lot these days is increasing the > block size for things like /data1 - that will often have fewer larger > files. FreeBSD4 reserves 16K of VM per struct buf no matter what, so > increasing the block size from 8K to 16K is a breeze. Larger block > sizes will put more pressure on the buffer cache and may still have > heavy-load deadlock situations , but should also generally work. I tend to use a block size of 16K to get the number of cylinder groups down to semi-sane levels. Marius To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 13:19:13 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail-relay.eunet.no (mail-relay.eunet.no [193.71.71.242]) by hub.freebsd.org (Postfix) with ESMTP id 9782C37B4C5 for ; Thu, 2 Nov 2000 13:19:10 -0800 (PST) Received: from login-1.eunet.no (login-1.eunet.no [193.75.110.2]) by mail-relay.eunet.no (8.9.3/8.9.3/GN) with ESMTP id WAA94286; Thu, 2 Nov 2000 22:19:09 +0100 (CET) (envelope-from mbendiks@eunet.no) Received: from localhost (mbendiks@localhost) by login-1.eunet.no (8.9.3/8.8.8) with ESMTP id WAA13280; Thu, 2 Nov 2000 22:19:09 +0100 (CET) (envelope-from mbendiks@eunet.no) X-Authentication-Warning: login-1.eunet.no: mbendiks owned process doing -bs Date: Thu, 2 Nov 2000 22:19:09 +0100 (CET) From: Marius Bendiksen To: Matt Dillon Cc: Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-Reply-To: <200011021725.eA2HPeM38718@earth.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > Not to mention the bytes/inode (-i) If you want fsck to go fast on a > big filesystem, reducing the number of inodes helps a lot. I find myself > using -i 32768 or -i 65536 or even higher numbers on partitions which > hold big database files. FFS is woefully inadequate at handling databases, due to the block indirection, but e.g. Oracle will allow you to run directly on top of a device. As to the cylinder group count, couldn't we, rather than changing the default, have a new flag to newfs which takes it to the maximum possible, and preferrably another one to try to keep the number of cylinder groups sane by pushing the block size etc up automatically? Marius To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 13:21:47 2000 Delivered-To: freebsd-arch@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id A96DE37B4C5 for ; Thu, 2 Nov 2000 13:21:44 -0800 (PST) Received: (from bright@localhost) by fw.wintelcom.net (8.10.0/8.10.0) id eA2LLef12044; Thu, 2 Nov 2000 13:21:40 -0800 (PST) Date: Thu, 2 Nov 2000 13:21:40 -0800 From: Alfred Perlstein To: Marius Bendiksen Cc: Matt Dillon , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep Message-ID: <20001102132140.W20567@fw.wintelcom.net> References: <200011021725.eA2HPeM38718@earth.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.4i In-Reply-To: ; from mbendiks@eunet.no on Thu, Nov 02, 2000 at 10:19:09PM +0100 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG * Marius Bendiksen [001102 13:19] wrote: > > Not to mention the bytes/inode (-i) If you want fsck to go fast on a > > big filesystem, reducing the number of inodes helps a lot. I find myself > > using -i 32768 or -i 65536 or even higher numbers on partitions which > > hold big database files. > > FFS is woefully inadequate at handling databases, due to the block > indirection, but e.g. Oracle will allow you to run directly on top > of a device. Block indirection could be optimized by attempting to allocate indirect blocks in the same area as either the inode or datablocks that the indirect blocks address. > As to the cylinder group count, couldn't we, rather than changing the > default, have a new flag to newfs which takes it to the maximum possible, > and preferrably another one to try to keep the number of cylinder groups > sane by pushing the block size etc up automatically? Yes, patches would be nice. :) -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 13:35:20 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 64BC537B6E4 for ; Thu, 2 Nov 2000 13:35:17 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA2LZA740940; Thu, 2 Nov 2000 13:35:10 -0800 (PST) (envelope-from dillon) Date: Thu, 2 Nov 2000 13:35:10 -0800 (PST) From: Matt Dillon Message-Id: <200011022135.eA2LZA740940@earth.backplane.com> To: Alfred Perlstein Cc: Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011021725.eA2HPeM38718@earth.backplane.com> <20001102132140.W20567@fw.wintelcom.net> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :* Marius Bendiksen [001102 13:19] wrote: :> > Not to mention the bytes/inode (-i) If you want fsck to go fast on a :> > big filesystem, reducing the number of inodes helps a lot. I find myself :> > using -i 32768 or -i 65536 or even higher numbers on partitions which :> > hold big database files. :> :> FFS is woefully inadequate at handling databases, due to the block :> indirection, but e.g. Oracle will allow you to run directly on top :> of a device. : :Block indirection could be optimized by attempting to allocate :indirect blocks in the same area as either the inode or datablocks :that the indirect blocks address. Indirect blocks aren't relevant if you are using a large block size, because there are few enough of them the OS has no problem caching them. Consider a 32 GB table file: BlockSize Bytes required to store leaf indirect blocks for a 32GB file --------------- ----- 8K blocks size 16MB 32K block size 4MB calculation: filesize / blocksize * 4 = # of bytes worth of leaf indirect blocks required to reference the file. (higher level indirect blocks are inconsequential) It becomes somewhat more of an issue for a terrabyte-sized database, but still no biggy considering the memory you can get these days. A raw device will still be better, but not by much. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 13:58: 2 2000 Delivered-To: freebsd-arch@freebsd.org Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) by hub.freebsd.org (Postfix) with ESMTP id E005837B4CF for ; Thu, 2 Nov 2000 13:57:58 -0800 (PST) Received: (from bright@localhost) by fw.wintelcom.net (8.10.0/8.10.0) id eA2Lvs814424; Thu, 2 Nov 2000 13:57:54 -0800 (PST) Date: Thu, 2 Nov 2000 13:57:54 -0800 From: Alfred Perlstein To: Matt Dillon Cc: Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep Message-ID: <20001102135754.Y20567@fw.wintelcom.net> References: <200011021725.eA2HPeM38718@earth.backplane.com> <20001102132140.W20567@fw.wintelcom.net> <200011022135.eA2LZA740940@earth.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.4i In-Reply-To: <200011022135.eA2LZA740940@earth.backplane.com>; from dillon@earth.backplane.com on Thu, Nov 02, 2000 at 01:35:10PM -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG * Matt Dillon [001102 13:36] wrote: > > : > :* Marius Bendiksen [001102 13:19] wrote: > :> > Not to mention the bytes/inode (-i) If you want fsck to go fast on a > :> > big filesystem, reducing the number of inodes helps a lot. I find myself > :> > using -i 32768 or -i 65536 or even higher numbers on partitions which > :> > hold big database files. > :> > :> FFS is woefully inadequate at handling databases, due to the block > :> indirection, but e.g. Oracle will allow you to run directly on top > :> of a device. > : > :Block indirection could be optimized by attempting to allocate > :indirect blocks in the same area as either the inode or datablocks > :that the indirect blocks address. > > Indirect blocks aren't relevant if you are using a large block size, > because there are few enough of them the OS has no problem caching > them. the problem isn't caching them, it's fsyncing them during appends that cause additinal disk seeks. But that's not exactly a deadly problem, just a little suboptimal. it's also fsyncs on newly created files that can cause problems. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 14:29: 8 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail-relay.eunet.no (mail-relay.eunet.no [193.71.71.242]) by hub.freebsd.org (Postfix) with ESMTP id A61EB37B479 for ; Thu, 2 Nov 2000 14:29:05 -0800 (PST) Received: from login-1.eunet.no (login-1.eunet.no [193.75.110.2]) by mail-relay.eunet.no (8.9.3/8.9.3/GN) with ESMTP id XAA97354; Thu, 2 Nov 2000 23:29:03 +0100 (CET) (envelope-from mbendiks@eunet.no) Received: from localhost (mbendiks@localhost) by login-1.eunet.no (8.9.3/8.8.8) with ESMTP id XAA13548; Thu, 2 Nov 2000 23:29:03 +0100 (CET) (envelope-from mbendiks@eunet.no) X-Authentication-Warning: login-1.eunet.no: mbendiks owned process doing -bs Date: Thu, 2 Nov 2000 23:29:03 +0100 (CET) From: Marius Bendiksen To: Alfred Perlstein Cc: Matt Dillon , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-Reply-To: <20001102132140.W20567@fw.wintelcom.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > Block indirection could be optimized by attempting to allocate > indirect blocks in the same area as either the inode or datablocks > that the indirect blocks address. Actually, block indirection could be fixed by raping the code to support the notion of extents. As to allocating in the general locality of the inode or datablock, that would require you to be within a distance of 1 track, and on a system with better things to spend its cache on than indirect blocks, you'll lose some when you hit double or triple indirect, especially with random access. As a side note, I've thought about abusing the actual inodes themselves to hold single indirect blocks. Opinions, apart from the general evilness of abusing the structures in such a fashion? > Yes, patches would be nice. :) Patches cannot be formed until a general consensus exists on how the patches should do things if and when an enterprising soul made them. Otherwise, they stand a good chance at being rejected based on some, possibly relevant, objection to how they work. Also, such patches are likely best formed by the same people that are currently suggesting doing a variety of other things for disklabel and friends. Marius To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 14:32:44 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id 88D1337B4C5 for ; Thu, 2 Nov 2000 14:32:23 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YB77J; Thu, 2 Nov 2000 17:32:29 -0500 Reply-To: Randell Jesup To: Matt Dillon Cc: Randell Jesup , Marius Bendiksen , arch@FreeBSD.ORG, Warner Losh Subject: Re: Like to commit my diskprep References: <200011021632.eA2GWZ138286@earth.backplane.com> <200011021725.eA2HPeM38718@earth.backplane.com> From: Randell Jesup Date: 02 Nov 2000 17:36:15 -0500 In-Reply-To: Matt Dillon's message of "Thu, 2 Nov 2000 09:25:40 -0800 (PST)" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Following are changes to disklabel.c to change it to be much more user- friendly (and kick out more warnings for things like overlapped partitions). It supports the syntax I mentioned before: [snipped to keep things smaller] # size offset fstype [fsize bsize bps/cpg] a: 819200 0 4.2BSD 4096 16384 75 # (Cyl. 0 - 812*) b: 800M 819200 swap c: * 0 unused 0 0 e: 10% * 4.2BSD f: 5g * 4.2BSD g: * * 4.2BSD Note that I support both % and * for hog partitions. % is calculated as a %age of space remaining after all fixed-size partitions are dealt with, and * for size (hog) gets anything left after that. * for offset means "calculate it your damn self, disklabel, I don't care". :-) If fsize/etc aren't specified, it uses defaults. This might want to be improved (better defaults, as per other conversation). I also added a -n option to supress writing labels to the slice. Handy for testing!!! If -n is used, it'll dump what it would have written to stdout. WARNING: Playing with disklabel CAN frag your disk!!!!! Todo: add warnings about incompatibilities between old and new labels. update manpage change newfs defaults not require "0 0" after an unused partition. -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com *** /usr/src/sbin/disklabel/disklabel.c Sat Jul 1 02:47:46 2000 --- disklabel.c Thu Nov 2 15:29:04 2000 *************** *** 80,85 **** --- 80,91 ---- #define BBSIZE 8192 /* size of boot area, with label */ #endif + /* FIX! These are too low, but are traditional */ + #define DEFAULT_NEWFS_BLOCK 8192U + #define DEFAULT_NEWFS_FRAG 1024U + #define DEFAULT_NEWFS_CPG 16U + + #ifdef tahoe #define NUMBOOT 0 #else *************** *** 119,124 **** --- 125,138 ---- struct disklabel lab; char bootarea[BBSIZE]; + /* partition 'c' is the full disk and is special */ + #define FULL_DISK_PART 2 + #define MAX_PART ('z') + #define MAX_NUM_PARTS (1 + MAX_PART-'a') + char part_size_type[MAX_NUM_PARTS]; + char part_offset_type[MAX_NUM_PARTS]; + int part_set[MAX_NUM_PARTS]; + #if NUMBOOT > 0 int installboot; /* non-zero if we should install a boot program */ char *bootbuf; /* pointer to buffer with remainder of boot prog */ *************** *** 134,145 **** } op = UNSPEC; int rflag; #ifdef DEBUG int debug; ! #define OPTIONS "BNRWb:ders:w" #else ! #define OPTIONS "BNRWb:ers:w" #endif int --- 148,160 ---- } op = UNSPEC; int rflag; + int disable_write; /* set to disable writing to disk label */ #ifdef DEBUG int debug; ! #define OPTIONS "BNRWb:denrs:w" #else ! #define OPTIONS "BNRWb:enrs:w" #endif int *************** *** 152,157 **** --- 167,174 ---- int ch, f = 0, flag, error = 0; char *name = 0; + disable_write = 0; /* paranoia */ + while ((ch = getopt(argc, argv, OPTIONS)) != -1) switch (ch) { #if NUMBOOT > 0 *************** *** 172,177 **** --- 189,197 ---- usage(); op = NOWRITE; break; + case 'n': + disable_write = 1; + break; case 'R': if (op != UNSPEC) usage(); *************** *** 398,473 **** register int i; #endif ! setbootflag(lp); ! lp->d_magic = DISKMAGIC; ! lp->d_magic2 = DISKMAGIC; ! lp->d_checksum = 0; ! lp->d_checksum = dkcksum(lp); ! if (rflag) { ! /* ! * First set the kernel disk label, ! * then write a label to the raw disk. ! * If the SDINFO ioctl fails because it is unimplemented, ! * keep going; otherwise, the kernel consistency checks ! * may prevent us from changing the current (in-core) ! * label. ! */ ! if (ioctl(f, DIOCSDINFO, lp) < 0 && ! errno != ENODEV && errno != ENOTTY) { ! l_perror("ioctl DIOCSDINFO"); ! return (1); ! } ! (void)lseek(f, (off_t)0, SEEK_SET); ! #ifdef __alpha__ ! /* ! * Generate the bootblock checksum for the SRM console. ! */ ! for (p = (u_long *)boot, i = 0, sum = 0; i < 63; i++) ! sum += p[i]; ! p[63] = sum; #endif ! ! /* ! * write enable label sector before write (if necessary), ! * disable after writing. ! */ ! flag = 1; ! if (ioctl(f, DIOCWLABEL, &flag) < 0) ! warn("ioctl DIOCWLABEL"); ! if (write(f, boot, lp->d_bbsize) != lp->d_bbsize) { ! warn("write"); ! return (1); ! } #if NUMBOOT > 0 ! /* ! * Output the remainder of the disklabel ! */ ! if (bootbuf && write(f, bootbuf, bootsize) != bootsize) { ! warn("write"); ! return(1); ! } #endif ! flag = 0; ! (void) ioctl(f, DIOCWLABEL, &flag); ! } else if (ioctl(f, DIOCWDINFO, lp) < 0) { ! l_perror("ioctl DIOCWDINFO"); ! return (1); ! } #ifdef vax ! if (lp->d_type == DTYPE_SMD && lp->d_flags & D_BADSECT) { ! daddr_t alt; ! ! alt = lp->d_ncylinders * lp->d_secpercyl - lp->d_nsectors; ! for (i = 1; i < 11 && i < lp->d_nsectors; i += 2) { ! (void)lseek(f, (off_t)((alt + i) * lp->d_secsize), ! SEEK_SET); ! if (write(f, boot, lp->d_secsize) < lp->d_secsize) ! warn("alternate label %d write", i/2); } - } #endif ! return (0); } void --- 418,502 ---- register int i; #endif ! if (disable_write) ! { ! Warning("write to disk label supressed - label was as follows:"); ! display(stdout,lp); ! return 0; ! } ! else ! { ! setbootflag(lp); ! lp->d_magic = DISKMAGIC; ! lp->d_magic2 = DISKMAGIC; ! lp->d_checksum = 0; ! lp->d_checksum = dkcksum(lp); ! if (rflag) { ! /* ! * First set the kernel disk label, ! * then write a label to the raw disk. ! * If the SDINFO ioctl fails because it is unimplemented, ! * keep going; otherwise, the kernel consistency checks ! * may prevent us from changing the current (in-core) ! * label. ! */ ! if (ioctl(f, DIOCSDINFO, lp) < 0 && ! errno != ENODEV && errno != ENOTTY) { ! l_perror("ioctl DIOCSDINFO"); ! return (1); ! } ! (void)lseek(f, (off_t)0, SEEK_SET); ! #ifdef __alpha__ ! /* ! * Generate the bootblock checksum for the SRM console. ! */ ! for (p = (u_long *)boot, i = 0, sum = 0; i < 63; i++) ! sum += p[i]; ! p[63] = sum; #endif ! ! /* ! * write enable label sector before write (if necessary), ! * disable after writing. ! */ ! flag = 1; ! if (ioctl(f, DIOCWLABEL, &flag) < 0) ! warn("ioctl DIOCWLABEL"); ! if (write(f, boot, lp->d_bbsize) != lp->d_bbsize) { ! warn("write"); ! return (1); ! } #if NUMBOOT > 0 ! /* ! * Output the remainder of the disklabel ! */ ! if (bootbuf && write(f, bootbuf, bootsize) != bootsize) { ! warn("write"); ! return(1); ! } #endif ! flag = 0; ! (void) ioctl(f, DIOCWLABEL, &flag); ! } else if (ioctl(f, DIOCWDINFO, lp) < 0) { ! l_perror("ioctl DIOCWDINFO"); ! return (1); ! } #ifdef vax ! if (lp->d_type == DTYPE_SMD && lp->d_flags & D_BADSECT) { ! daddr_t alt; ! ! alt = lp->d_ncylinders * lp->d_secpercyl - lp->d_nsectors; ! for (i = 1; i < 11 && i < lp->d_nsectors; i += 2) { ! (void)lseek(f, (off_t)((alt + i) * lp->d_secsize), ! SEEK_SET); ! if (write(f, boot, lp->d_secsize) < lp->d_secsize) ! warn("alternate label %d write", i/2); ! } } #endif ! return (0); ! } } void *************** *** 934,942 **** --- 963,980 ---- { register char **cpp, *cp; register struct partition *pp; + register int i; char *tp, *s, line[BUFSIZ]; int v, lineno = 0, errors = 0; + for (i = 0; i < MAX_NUM_PARTS; i++) + { + /* paranoia */ + part_set[i] = 0; + part_size_type[i] = '\0'; + part_offset_type[i] = '\0'; + } + lp->d_bbsize = BBSIZE; /* XXX */ lp->d_sbsize = SBSIZE; /* XXX */ while (fgets(line, sizeof(line) - 1, f)) { *************** *** 1138,1153 **** lp->d_trkseek = v; continue; } ! if ('a' <= *cp && *cp <= 'z' && cp[1] == '\0') { unsigned part = *cp - 'a'; if (part > lp->d_npartitions) { fprintf(stderr, ! "line %d: bad partition name\n", lineno); errors++; continue; } pp = &lp->d_partitions[part]; #define NXTNUM(n) { \ if (tp == NULL) { \ fprintf(stderr, "line %d: too few numeric fields\n", lineno); \ --- 1176,1194 ---- lp->d_trkseek = v; continue; } ! /* the ':' was removed above */ ! if ('a' <= *cp && *cp <= MAX_PART && cp[1] == '\0') { unsigned part = *cp - 'a'; if (part > lp->d_npartitions) { fprintf(stderr, ! "line %d: partition name out of range a-%c: %s\n", ! lineno,'a' + lp->d_npartitions - 1,cp); errors++; continue; } pp = &lp->d_partitions[part]; + part_set[part] = 1; #define NXTNUM(n) { \ if (tp == NULL) { \ fprintf(stderr, "line %d: too few numeric fields\n", lineno); \ *************** *** 1160,1231 **** (n) = atoi(cp); \ } \ } ! NXTNUM(v); ! if (v < 0) { fprintf(stderr, "line %d: %s: bad partition size\n", lineno, cp); errors++; - } else - pp->p_size = v; - NXTNUM(v); - if (v < 0) { - fprintf(stderr, - "line %d: %s: bad partition offset\n", - lineno, cp); - errors++; - } else - pp->p_offset = v; - cp = tp, tp = word(cp); - cpp = fstypenames; - for (; cpp < &fstypenames[FSMAXTYPES]; cpp++) - if ((s = *cpp) && streq(s, cp)) { - pp->p_fstype = cpp - fstypenames; - goto gottype; - } - if (isdigit(*cp)) - v = atoi(cp); - else - v = FSMAXTYPES; - if ((unsigned)v >= FSMAXTYPES) { - fprintf(stderr, "line %d: %s %s\n", lineno, - "Warning, unknown filesystem type", cp); - v = FS_UNUSED; - } - pp->p_fstype = v; - gottype: - - switch (pp->p_fstype) { - - case FS_UNUSED: /* XXX */ - NXTNUM(pp->p_fsize); - if (pp->p_fsize == 0) - break; - NXTNUM(v); - pp->p_frag = v / pp->p_fsize; break; ! case FS_BSDFFS: ! NXTNUM(pp->p_fsize); ! if (pp->p_fsize == 0) break; ! NXTNUM(v); ! pp->p_frag = v / pp->p_fsize; ! NXTNUM(pp->p_cpg); ! break; ! case FS_BSDLFS: ! NXTNUM(pp->p_fsize); ! if (pp->p_fsize == 0) ! break; ! NXTNUM(v); ! pp->p_frag = v / pp->p_fsize; ! NXTNUM(pp->p_cpg); ! break; ! default: ! break; } continue; } --- 1201,1315 ---- (n) = atoi(cp); \ } \ } + /* retain 1 character following number */ + #define NXTWORD(w,n) { \ + if (tp == NULL) { \ + fprintf(stderr, "line %d: too few numeric fields\n", lineno); \ + errors++; \ + break; \ + } else { \ + char *tmp; \ + cp = tp, tp = word(cp); \ + if (tp == NULL) \ + tp = cp; \ + (n) = strtol(cp,&tmp,10); \ + if (tmp) (w) = *tmp; \ + } \ + } ! v = 0; ! NXTWORD(part_size_type[part],v); ! if (v < 0 || ! (v == 0 && ! part_size_type[part] != '*')) ! { fprintf(stderr, "line %d: %s: bad partition size\n", lineno, cp); errors++; break; + } + else + { + pp->p_size = v; ! v = 0; ! NXTWORD(part_offset_type[part],v); ! if (v < 0 || ! (v == 0 && ! part_offset_type[part] != '*' && ! part_offset_type[part] != '\0')) ! { ! fprintf(stderr, ! "line %d: %s: bad partition offset\n", ! lineno, cp); ! errors++; break; ! } ! else ! { ! pp->p_offset = v; ! ! cp = tp, tp = word(cp); ! cpp = fstypenames; ! for (; cpp < &fstypenames[FSMAXTYPES]; cpp++) ! if ((s = *cpp) && streq(s, cp)) { ! pp->p_fstype = cpp - fstypenames; ! goto gottype; ! } ! if (isdigit(*cp)) ! v = atoi(cp); ! else ! v = FSMAXTYPES; ! if ((unsigned)v >= FSMAXTYPES) { ! fprintf(stderr, "line %d: %s %s\n", lineno, ! "Warning, unknown filesystem type", cp); ! v = FS_UNUSED; ! } ! pp->p_fstype = v; ! gottype: ! /* Note: NXTNUM will break us out of the switch only */ ! switch (pp->p_fstype) { ! case FS_UNUSED: /* XXX */ ! /* are these really used for anything? */ ! NXTNUM(pp->p_fsize); ! if (pp->p_fsize == 0) ! break; ! NXTNUM(v); ! pp->p_frag = v / pp->p_fsize; ! break; ! ! /* These happen to be the same */ ! case FS_BSDFFS: ! case FS_BSDLFS: ! /* allow us to accept defaults for fsize/frag/cpg */ ! if (tp) ! { ! NXTNUM(pp->p_fsize); ! if (pp->p_fsize == 0) ! break; ! NXTNUM(v); ! pp->p_frag = v / pp->p_fsize; ! NXTNUM(pp->p_cpg); ! } ! else ! { ! /* FIX! These are too low, but are traditional */ ! pp->p_fsize = DEFAULT_NEWFS_BLOCK; ! pp->p_frag = (unsigned char) DEFAULT_NEWFS_FRAG; ! pp->p_cpg = DEFAULT_NEWFS_CPG; ! /* FIX! we should make these adaptive */ ! } ! break; ! ! default: ! break; ! } ! /* note: we may not have gotten all the entries for the fs */ ! /* though if we didn't, errors will be set. */ ! } } continue; } *************** *** 1250,1255 **** --- 1334,1342 ---- register struct partition *pp; int i, errors = 0; char part; + unsigned long total_size,total_percent,current_offset; + int seen_default_offset; + int hog_part; if (lp->d_secsize == 0) { fprintf(stderr, "sector size 0\n"); *************** *** 1286,1291 **** --- 1373,1547 ---- if (lp->d_npartitions > MAXPARTITIONS) Warning("number of partitions (%lu) > MAXPARTITIONS (%d)", (u_long)lp->d_npartitions, MAXPARTITIONS); + + /* first allocate space to the partitions, then offsets */ + total_size = 0; /* in sectors */ + total_percent = 0; /* in percent */ + hog_part = -1; + /* find all fixed partitions */ + for (i = 0; i < lp->d_npartitions; i++) { + pp = &lp->d_partitions[i]; + if (part_set[i]) + { + if (part_size_type[i] == '*') + { + /* partition 2 ('c') is special */ + if (i == FULL_DISK_PART) + pp->p_size = lp->d_secperunit; + else { + if (hog_part != -1) + Warning("Too many '*'-sized partitions (%c and %c)", + hog_part+'a',i+'a'); + else + hog_part = i; + } + } + else + { + char *type; + unsigned long size; + + size = pp->p_size; + switch (part_size_type[i]) + { + case '%': + total_percent += size; + break; + + case 'k': + case 'K': + size *= 1024UL; + break; + + case 'm': + case 'M': + size *= ((unsigned long) 1024*1024); + break; + + case 'g': + case 'G': + size *= ((unsigned long) 1024*1024*1024); + break; + + case '\0': + break; + + default: + Warning("unknown size specifier '%c' (K/M/G are valid)",part_size_type[i]); + break; + } + /* don't count %'s yet */ + if (part_size_type[i] != '%') + { + /* for all not in sectors, convert to sectors */ + if (part_size_type[i] != '\0') + { + if (size % lp->d_secsize != 0) + Warning("partition %c not an integer number of sectors", + i + 'a'); + size /= lp->d_secsize; + pp->p_size = size; + } + /* else already in sectors */ + /* partition 2 ('c') is special */ + if (i != FULL_DISK_PART) + total_size += size; + } + } + } + } + /* handle % partitions - note %'s don't need to add up to 100! */ + if (total_percent != 0) + { + long free_space = lp->d_secperunit - total_size; + + if (total_percent > 100) + { + fprintf(stderr,"total percentage %d is greater than 100\n", + total_percent); + errors++; + } + + if (free_space > 0) + { + for (i = 0; i < lp->d_npartitions; i++) { + pp = &lp->d_partitions[i]; + if (part_set[i] && part_size_type[i] == '%') + { + unsigned long old_size = pp->p_size; + + /* careful of overflows! and integer roundoff */ + pp->p_size = ((double)pp->p_size/100) * free_space; + total_size += pp->p_size; + + /* FIX we can lose a sector or so due to roundoff per + partition. A more complex algorithm could avoid that */ + } + } + } + else + { + fprintf(stderr, + "%ld sectors available to give to '*' and '%' partitions\n", + free_space); + errors++; + /* fix? set all % partitions to size 0? */ + } + } + /* give anything remaining to the hog partition */ + if (hog_part != -1) + { + lp->d_partitions[hog_part].p_size = lp->d_secperunit - total_size; + total_size = lp->d_secperunit; + } + + /* Now set the offsets for each partition */ + current_offset = 0; /* in sectors */ + seen_default_offset = 0; + for (i = 0; i < lp->d_npartitions; i++) { + part = 'a' + i; + pp = &lp->d_partitions[i]; + if (part_set[i]) + { + if (part_offset_type[i] == '*') + { + /* partition 2 ('c') is special */ + if (i == FULL_DISK_PART) + pp->p_offset = 0; + else + { + pp->p_offset = current_offset; + seen_default_offset = 1; + } + } + else + { + /* allow them to be out of order for old-style tables */ + /* partition 2 ('c') is special */ + if (pp->p_offset < current_offset && seen_default_offset && + i != FULL_DISK_PART) + { + fprintf(stderr, + "Offset %ld for partition %c overlaps previous partition which ends at %ld\n", + pp->p_offset,i+'a',current_offset); + fprintf(stderr, + "Labels with any *'s for offset must be in ascending order by sector\n"); + errors++; + } + else if (pp->p_offset != current_offset && + i != FULL_DISK_PART && seen_default_offset) + { + /* this may give unneeded warnings if partitions are out-of-order */ + Warning("Offset %ld for partition %c doesn't match expected value %ld", + pp->p_offset,i+'a',current_offset); + } + } + /* partition 2 ('c') is special */ + if (i != FULL_DISK_PART) + current_offset = pp->p_offset + pp->p_size; + } + } + for (i = 0; i < lp->d_npartitions; i++) { part = 'a' + i; pp = &lp->d_partitions[i]; *************** *** 1311,1316 **** --- 1567,1596 ---- part); errors++; } + + /* check for overlaps */ + { + int j; + register struct partition *pp2; + + /* this will check for all possible overlaps once and only once */ + for (j = 0; j < i; j++) { + /* partition 2 ('c') is special */ + if (j != FULL_DISK_PART && i != FULL_DISK_PART && + part_set[i] && part_set[j]) + { + pp2 = &lp->d_partitions[j]; + if (pp2->p_offset < pp->p_offset + pp->p_size && + (pp2->p_offset + pp2->p_size > pp->p_offset || + pp2->p_offset >= pp->p_offset)) + { + fprintf(stderr,"partitions %c and %c overlap!\n", + j+'a', i+'a'); + errors++; + } + } + } + } } for (; i < MAXPARTITIONS; i++) { part = 'a' + i; *************** *** 1421,1445 **** fprintf(stderr, "%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n", "usage: disklabel [-r] disk", "\t\t(to read label)", ! " disklabel -w [-r] disk type [ packid ]", "\t\t(to write label with existing boot program)", ! " disklabel -e [-r] disk", "\t\t(to edit label)", ! " disklabel -R [-r] disk protofile", "\t\t(to restore label with existing boot program)", #if NUMBOOT > 1 ! " disklabel -B [ -b boot1 [ -s boot2 ] ] disk [ type ]", "\t\t(to install boot program with existing label)", ! " disklabel -w -B [ -b boot1 [ -s boot2 ] ] disk type [ packid ]", "\t\t(to write label and boot program)", ! " disklabel -R -B [ -b boot1 [ -s boot2 ] ] disk protofile [ type ]", "\t\t(to restore label and boot program)", #else ! " disklabel -B [ -b bootprog ] disk [ type ]", "\t\t(to install boot program with existing on-disk label)", ! " disklabel -w -B [ -b bootprog ] disk type [ packid ]", "\t\t(to write label and install boot program)", ! " disklabel -R -B [ -b bootprog ] disk protofile [ type ]", "\t\t(to restore label and install boot program)", #endif " disklabel [-NW] disk", --- 1701,1725 ---- fprintf(stderr, "%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n", "usage: disklabel [-r] disk", "\t\t(to read label)", ! " disklabel -w [-r] [-n] disk type [ packid ]", "\t\t(to write label with existing boot program)", ! " disklabel -e [-r] [-n] disk", "\t\t(to edit label)", ! " disklabel -R [-r] [-n] disk protofile", "\t\t(to restore label with existing boot program)", #if NUMBOOT > 1 ! " disklabel -B [-n] [ -b boot1 [ -s boot2 ] ] disk [ type ]", "\t\t(to install boot program with existing label)", ! " disklabel -w -B [-n] [ -b boot1 [ -s boot2 ] ] disk type [ packid ]", "\t\t(to write label and boot program)", ! " disklabel -R -B [-n] [ -b boot1 [ -s boot2 ] ] disk protofile [ type ]", "\t\t(to restore label and boot program)", #else ! " disklabel -B [-n] [ -b bootprog ] disk [ type ]", "\t\t(to install boot program with existing on-disk label)", ! " disklabel -w -B [-n] [ -b bootprog ] disk type [ packid ]", "\t\t(to write label and install boot program)", ! " disklabel -R -B [-n] [ -b bootprog ] disk protofile [ type ]", "\t\t(to restore label and install boot program)", #endif " disklabel [-NW] disk", *************** *** 1447,1457 **** #else fprintf(stderr, "%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n", "usage: disklabel [-r] disk", "(to read label)", ! " disklabel -w [-r] disk type [ packid ]", "\t\t(to write label)", ! " disklabel -e [-r] disk", "\t\t(to edit label)", ! " disklabel -R [-r] disk protofile", "\t\t(to restore label)", " disklabel [-NW] disk", "\t\t(to write disable/enable label)"); --- 1727,1737 ---- #else fprintf(stderr, "%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n%s\n", "usage: disklabel [-r] disk", "(to read label)", ! " disklabel -w [-r] [-n] disk type [ packid ]", "\t\t(to write label)", ! " disklabel -e [-r] [-n] disk", "\t\t(to edit label)", ! " disklabel -R [-r] [-n] disk protofile", "\t\t(to restore label)", " disklabel [-NW] disk", "\t\t(to write disable/enable label)"); To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 15: 1:26 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id AA70837B4D7 for ; Thu, 2 Nov 2000 15:01:23 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA2N1H441744; Thu, 2 Nov 2000 15:01:17 -0800 (PST) (envelope-from dillon) Date: Thu, 2 Nov 2000 15:01:17 -0800 (PST) From: Matt Dillon Message-Id: <200011022301.eA2N1H441744@earth.backplane.com> To: Alfred Perlstein Cc: Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011021725.eA2HPeM38718@earth.backplane.com> <20001102132140.W20567@fw.wintelcom.net> <200011022135.eA2LZA740940@earth.backplane.com> <20001102135754.Y20567@fw.wintelcom.net> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :the problem isn't caching them, it's fsyncing them during appends :that cause additinal disk seeks. But that's not exactly a deadly :problem, just a little suboptimal. : :it's also fsyncs on newly created files that can cause problems. : :-- :-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] This shouldn't be an issue. I don't know about Oracle, but with my database I pre-extend the file (e.g. in 1 MB increments, by writing zero's to the file rather then ftruncate()ing), and you don't have to fsync() every block when pre-extending a file. The database appends into space already allocated from the file extension and so no additional seeking occurs. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 15: 6:47 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 261A237B4E5 for ; Thu, 2 Nov 2000 15:06:40 -0800 (PST) Received: (from daemon@localhost) by smtp04.primenet.com (8.9.3/8.9.3) id QAA05141; Thu, 2 Nov 2000 16:03:22 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp04.primenet.com, id smtpdAAAvraaSj; Thu Nov 2 16:03:05 2000 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id QAA20908; Thu, 2 Nov 2000 16:06:16 -0700 (MST) From: Terry Lambert Message-Id: <200011022306.QAA20908@usr09.primenet.com> Subject: Re: Like to commit my diskprep To: mbendiks@eunet.no (Marius Bendiksen) Date: Thu, 2 Nov 2000 23:06:16 +0000 (GMT) Cc: dillon@earth.backplane.com (Matt Dillon), rjesup@wgate.com (Randell Jesup), arch@FreeBSD.ORG In-Reply-To: from "Marius Bendiksen" at Nov 02, 2000 10:19:09 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > FFS is woefully inadequate at handling databases, due to the block > indirection, but e.g. Oracle will allow you to run directly on top > of a device. Wrong. The problem is in not exporting a transaction interface to user space, which means that the databases have to resort to multistage commit tactics, which can seriously damage performance. Even worse, there is no such thing as a "region sync" (at least in FreeBSD: other OSs have it), so you have to sync out all data via a linear traversal of the dirty block list, using fsync() to do the job, which further degrades performance. This is greatly exacerbated by the fact that you really only want to commit one transaction at a time, not all transactions, and each transaction is going to do the same thing, not knowing about the others (this is where the block indirection huts you, but it's not because block indirection is bad, it's because of what has to be done on top of the FS semantics to get transaction guarantees interacts badly with it, if the API is deficient). All of this is further exacerbated out of all proportion by the fact that this means that concurrent transactions can not proceed concurrently across an fscync(2), since by its nature, you can't fsync(2) unless all data to be fsync(2)'ed represents completed transactions. So you must implement phase concurrency, to ensure that the second phase of a two phase commit isn't startes with the first phase of a two phase commit outstanding, ot you must serialize operations through a turnstile algorithm. Either way, all of this turns fsync(2) into a phase stall barrier (at best) or a full on stalling barrier (at worst). For an O(2) taste of this O(6) (O(4)*O(2)) problem, compare normal write operations between an FS mounted with soft updates, and the same OS mounted synchronous. And now you see the database problem. It should be obvious that transactions could be implemented as two node single edges, in the context of soft updates, and the resulting transaction interface exported to user space, and used by a database application, to turn it back into an O(1) problem: which is what a database vendor does when they use a raw disk. In other words, the FFS _is_ a database, with some important semantics not being exported for use by database software which may be layered on top of it. This is also a good place to see that implmeneting soft updates as a graph of precomputed node relationships was probably not the wisest move, since mount time computation of the relationships (accompanied by node-node edge [dependency] resolving code), would have let yo implement a transaction layer as a stacking layer, not to mention that it would let yo apply soft updates to other layered FSs, even statically configured stacks (like, say, EXT2FS). NB: obviously O(1) is the base order, since dependent transactions will increase the natural order by the number of dependants, but for an N of 3, would you rather have an O(1)*O(3)=O(3) event, or an O(2)*(4)*O(3)=O(9) event... and that with no concurrency of other independant operations hitting the database at the same time? For 1024 records, that's 1024**3 (10**9) or 1024**9 (6*(10**243)), for the worst case. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 15:10:46 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 2DC5637B681 for ; Thu, 2 Nov 2000 15:10:42 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA2N9BC41849; Thu, 2 Nov 2000 15:09:11 -0800 (PST) (envelope-from dillon) Date: Thu, 2 Nov 2000 15:09:11 -0800 (PST) From: Matt Dillon Message-Id: <200011022309.eA2N9BC41849@earth.backplane.com> To: Randell Jesup Cc: Randell Jesup , Marius Bendiksen , arch@FreeBSD.ORG, Warner Losh Subject: Re: Like to commit my diskprep References: <200011021632.eA2GWZ138286@earth.backplane.com> <200011021725.eA2HPeM38718@earth.backplane.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :Following are changes to disklabel.c to change it to be much more user- :friendly (and kick out more warnings for things like overlapped :partitions). :.. Damn that's cool! I'd say commit it to -current now. If you want I can add documentation to disklabel.8 to describe your new features but if you want to do it yourself please use the documentation base already in current (which has my documentation additions from the slice fix), so we don't have to do a monster merge. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 15:23:19 2000 Delivered-To: freebsd-arch@freebsd.org Received: from dragon.nuxi.com (trang.nuxi.com [209.152.133.57]) by hub.freebsd.org (Postfix) with ESMTP id C4FCA37B479 for ; Thu, 2 Nov 2000 15:23:16 -0800 (PST) Received: (from obrien@localhost) by dragon.nuxi.com (8.9.3/8.9.1) id PAA18153; Thu, 2 Nov 2000 15:23:13 -0800 (PST) (envelope-from obrien) Date: Thu, 2 Nov 2000 15:23:13 -0800 From: "David O'Brien" To: Randell Jesup Cc: arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep Message-ID: <20001102152313.B17358@dragon.nuxi.com> Reply-To: arch@FreeBSD.ORG References: <200011021632.eA2GWZ138286@earth.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: ; from rjesup@wgate.com on Thu, Nov 02, 2000 at 12:02:13PM -0500 X-Operating-System: FreeBSD 5.0-CURRENT Organization: The NUXI BSD group X-Pgp-Rsa-Fingerprint: B7 4D 3E E9 11 39 5F A3 90 76 5D 69 58 D9 98 7A X-Pgp-Rsa-Keyid: 1024/34F9F9D5 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Thu, Nov 02, 2000 at 12:02:13PM -0500, Randell Jesup wrote: > The defaults for -b and -f and -c for newfs/etc are WOEFULLY > out-of-date. Perhaps we should discuss what the defaults should be updated to. (no theory please, only tested values) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 17:12:52 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail-relay.eunet.no (mail-relay.eunet.no [193.71.71.242]) by hub.freebsd.org (Postfix) with ESMTP id E170437B4CF for ; Thu, 2 Nov 2000 17:12:49 -0800 (PST) Received: from login-1.eunet.no (login-1.eunet.no [193.75.110.2]) by mail-relay.eunet.no (8.9.3/8.9.3/GN) with ESMTP id CAA03051; Fri, 3 Nov 2000 02:12:47 +0100 (CET) (envelope-from mbendiks@eunet.no) Received: from localhost (mbendiks@localhost) by login-1.eunet.no (8.9.3/8.8.8) with ESMTP id CAA14097; Fri, 3 Nov 2000 02:12:47 +0100 (CET) (envelope-from mbendiks@eunet.no) X-Authentication-Warning: login-1.eunet.no: mbendiks owned process doing -bs Date: Fri, 3 Nov 2000 02:12:46 +0100 (CET) From: Marius Bendiksen To: Matt Dillon Cc: Alfred Perlstein , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-Reply-To: <200011022135.eA2LZA740940@earth.backplane.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > Indirect blocks aren't relevant if you are using a large block size, > because there are few enough of them the OS has no problem caching > them. The problem is related to highly random access, as the indirect blocks will tend to get pushed out of the cache on occasion, requiring multiple seeks when the file is being accessed. Using extents will solve this. > 32K block size 4MB Note that these 4MB are better spent on caching real data than they are on compensating for the absence of extents in the FFS inode. > It becomes somewhat more of an issue for a terrabyte-sized database, > but still no biggy considering the memory you can get these days. I reiterate the above point. The kind of memory in question here is really way over the top, compared to the 8/16 bytes required to hold an extent reference and the bit to indicate that the inode uses such. > A raw device will still be better, but not by much. "But not by much" depends on the actor operating upon it, amongst other things. Marius To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 17:19:14 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail-relay.eunet.no (mail-relay.eunet.no [193.71.71.242]) by hub.freebsd.org (Postfix) with ESMTP id 38DCC37B4CF for ; Thu, 2 Nov 2000 17:19:11 -0800 (PST) Received: from login-1.eunet.no (login-1.eunet.no [193.75.110.2]) by mail-relay.eunet.no (8.9.3/8.9.3/GN) with ESMTP id CAA03153; Fri, 3 Nov 2000 02:19:10 +0100 (CET) (envelope-from mbendiks@eunet.no) Received: from localhost (mbendiks@localhost) by login-1.eunet.no (8.9.3/8.8.8) with ESMTP id CAA14117; Fri, 3 Nov 2000 02:19:10 +0100 (CET) (envelope-from mbendiks@eunet.no) X-Authentication-Warning: login-1.eunet.no: mbendiks owned process doing -bs Date: Fri, 3 Nov 2000 02:19:10 +0100 (CET) From: Marius Bendiksen To: Randell Jesup Cc: arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG This sounds just lovely to me. Please commit it to current ASAP. =) --- Marius Bendiksen To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 19:39:19 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id 220D837B4C5 for ; Thu, 2 Nov 2000 19:39:17 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA33d9D43976; Thu, 2 Nov 2000 19:39:09 -0800 (PST) (envelope-from dillon) Date: Thu, 2 Nov 2000 19:39:09 -0800 (PST) From: Matt Dillon Message-Id: <200011030339.eA33d9D43976@earth.backplane.com> To: Marius Bendiksen Cc: Alfred Perlstein , Randell Jesup , arch@freebsd.org Subject: Re: Like to commit my diskprep References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG : :> Indirect blocks aren't relevant if you are using a large block size, :> because there are few enough of them the OS has no problem caching :> them. : :The problem is related to highly random access, as the indirect blocks :will tend to get pushed out of the cache on occasion, requiring multiple :seeks when the file is being accessed. Using extents will solve this. : :> 32K block size 4MB : :Note that these 4MB are better spent on caching real data than they are on :compensating for the absence of extents in the FFS inode. :.. : :> It becomes somewhat more of an issue for a terrabyte-sized database, :> but still no biggy considering the memory you can get these days. : :I reiterate the above point. The kind of memory in question here is really :way over the top, compared to the 8/16 bytes required to hold an extent :reference and the bit to indicate that the inode uses such. : :Marius Lets put things into perspective here. You have a multi-gig or terrabyte database, and that pretty much means you have to at least a gig of ram in the machine that's going to be accessing it. Otherwise why bother with caching at all? If you have a machine with a gig of ram, losing 4MB is REALLY not a big deal. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Thu Nov 2 21:20:31 2000 Delivered-To: freebsd-arch@freebsd.org Received: from panzer.kdm.org (panzer.kdm.org [216.160.178.169]) by hub.freebsd.org (Postfix) with ESMTP id AC4D137B4C5; Thu, 2 Nov 2000 21:20:25 -0800 (PST) Received: (from ken@localhost) by panzer.kdm.org (8.9.3/8.9.1) id WAA13439; Thu, 2 Nov 2000 22:20:20 -0700 (MST) (envelope-from ken) Date: Thu, 2 Nov 2000 22:20:19 -0700 From: "Kenneth D. Merry" To: net@FreeBSD.org Subject: new zero copy sockets and NFS snapshot Message-ID: <20001102222019.A13422@panzer.kdm.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2i Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [ -arch and -current BCC'ed for wider coverage, please direct followups to -net and/or me ] I have put a new copy of the zero copy sockets and NFS patches, against -current as of early October 30th, 2000, here: http://people.FreeBSD.ORG/~ken/zero_copy/ Questions, comments and feedback are welcome. Besides being generated against a newer version of -current, the following things have changed in the new patches posted above: - Robert Picco's zero copy send code has been removed. It was never fixed to eliminate a data corruption problem, and it is likely that Drew Gallatin's code will make it into -current instead. - Bring the major number used in the ti(4) driver in line with the one we have reserved in sys/conf/majors. - Make sure calls to ti_hdr_split() are only made inside #ifdef TI_JUMBO_HDRSPLIT. - Convert the non-stock portions of the ti(4) driver from spls to mutexes. - Get rid of an extra make_dev(), and make sure the one in ti_attach() comes before we return. For those of you who missed the previous messages about this code (that went out to -net, -arch and -current), here's a quick list of what is included in the code: - Zero copy send and receive code, written by Drew Gallatin . - Zero copy NFS code, written by Drew Gallatin. - Header splitting firmware for Alteon's Tigon II boards (written by me), based on version 12.4.11 of their firmware. This is used in combination with the zero copy receive code to guarantee that the payload of TCP or UDP packet is placed into a page-aligned buffer. - Alteon firmware debugging ioctls and supporting routines for the Tigon driver (also written by me). This will help anyone who is doing firmware development under FreeBSD for the Tigon boards. The Alteon header splitting and debugging code was written for Pluto Technologies (www.plutotech.com), which kindly agreed to let me release the code. Ken -- Kenneth Merry ken@kdm.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 1:59:37 2000 Delivered-To: freebsd-arch@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id B88A037B4D7 for ; Fri, 3 Nov 2000 01:59:34 -0800 (PST) Received: (from des@localhost) by flood.ping.uio.no (8.9.3/8.9.3) id KAA38781; Fri, 3 Nov 2000 10:59:32 +0100 (CET) (envelope-from des@ofug.org) X-URL: http://www.ofug.org/~des/ X-Disclaimer: The views expressed in this message do not necessarily coincide with those of any organisation or company with which I am or have been affiliated. To: Marius Bendiksen Cc: Matt Dillon , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: From: Dag-Erling Smorgrav Date: 03 Nov 2000 10:59:32 +0100 In-Reply-To: Marius Bendiksen's message of "Thu, 2 Nov 2000 22:14:00 +0100 (CET)" Message-ID: Lines: 17 User-Agent: Gnus/5.0802 (Gnus v5.8.2) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Marius Bendiksen writes: > > e.g. there wouldn't be a whole lot of need for a 10G /usr. Once hard > > drives got big enough I just left it at 2G. > > This is a matter of preference (hence the reference to paint) and also the > use intended for the system in question. However, the code is not going to > be significantly more complex due to this, and I think it's a much better, > ie cleaner, way of doing it. One question that probably interests many of us is, can tuning those numbers reduce fsck time? Is fsck time strictly proportional to disk size, or does the number of inodes and/or cylinder groups affect it? DES -- Dag-Erling Smorgrav - des@ofug.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 2:15:31 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id EB45737B4C5 for ; Fri, 3 Nov 2000 02:15:29 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA3AEpV45562; Fri, 3 Nov 2000 02:14:51 -0800 (PST) (envelope-from dillon) Date: Fri, 3 Nov 2000 02:14:51 -0800 (PST) From: Matt Dillon Message-Id: <200011031014.eA3AEpV45562@earth.backplane.com> To: Dag-Erling Smorgrav Cc: Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :> This is a matter of preference (hence the reference to paint) and also the :> use intended for the system in question. However, the code is not going to :> be significantly more complex due to this, and I think it's a much better, :> ie cleaner, way of doing it. : :One question that probably interests many of us is, can tuning those :numbers reduce fsck time? Is fsck time strictly proportional to disk :size, or does the number of inodes and/or cylinder groups affect it? : :DES :-- :Dag-Erling Smorgrav - des@ofug.org Yes. Increasing the number of bytes per inode will reduce the number of inodes and thus reduce fsck time. Increasing the number of cylinders in a group will localize inodes into bigger chunks, reducing seeking and also thus reduce fsck time. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 2:28:58 2000 Delivered-To: freebsd-arch@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id 605C737B4FE for ; Fri, 3 Nov 2000 02:28:46 -0800 (PST) Received: (from des@localhost) by flood.ping.uio.no (8.9.3/8.9.3) id LAA38922; Fri, 3 Nov 2000 11:28:42 +0100 (CET) (envelope-from des@ofug.org) X-URL: http://www.ofug.org/~des/ X-Disclaimer: The views expressed in this message do not necessarily coincide with those of any organisation or company with which I am or have been affiliated. To: Matt Dillon Cc: Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011031014.eA3AEpV45562@earth.backplane.com> From: Dag-Erling Smorgrav Date: 03 Nov 2000 11:28:41 +0100 In-Reply-To: Matt Dillon's message of "Fri, 3 Nov 2000 02:14:51 -0800 (PST)" Message-ID: Lines: 11 User-Agent: Gnus/5.0802 (Gnus v5.8.2) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Matt Dillon writes: > Yes. Increasing the number of bytes per inode will reduce the number > of inodes and thus reduce fsck time. Increasing the number of cylinders > in a group will localize inodes into bigger chunks, reducing seeking > and also thus reduce fsck time. That was what I hoped - thanks! DES -- Dag-Erling Smorgrav - des@ofug.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 2:31:14 2000 Delivered-To: freebsd-arch@freebsd.org Received: from critter.freebsd.dk (flutter.freebsd.dk [212.242.40.147]) by hub.freebsd.org (Postfix) with ESMTP id 25CFB37B4C5 for ; Fri, 3 Nov 2000 02:31:12 -0800 (PST) Received: from critter (localhost [127.0.0.1]) by critter.freebsd.dk (8.11.1/8.9.3) with ESMTP id eA3AV2h04561; Fri, 3 Nov 2000 11:31:02 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: Dag-Erling Smorgrav Cc: Matt Dillon , Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-Reply-To: Your message of "03 Nov 2000 11:28:41 +0100." Date: Fri, 03 Nov 2000 11:31:02 +0100 Message-ID: <4559.973247462@critter> From: Poul-Henning Kamp Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message , Dag-Erling Smorgrav writes: >Matt Dillon writes: >> Yes. Increasing the number of bytes per inode will reduce the number >> of inodes and thus reduce fsck time. Increasing the number of cylinders >> in a group will localize inodes into bigger chunks, reducing seeking >> and also thus reduce fsck time. > >That was what I hoped - thanks! ...and setting your fragment size to 4k has a surprisingly small positive effect on your performance. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 6:42:45 2000 Delivered-To: freebsd-arch@freebsd.org Received: from point.osg.gov.bc.ca (point.osg.gov.bc.ca [142.32.102.44]) by hub.freebsd.org (Postfix) with ESMTP id 6203437B4CF for ; Fri, 3 Nov 2000 06:42:43 -0800 (PST) Received: (from daemon@localhost) by point.osg.gov.bc.ca (8.8.7/8.8.8) id GAA06855; Fri, 3 Nov 2000 06:41:40 -0800 Received: from passer.osg.gov.bc.ca(142.32.110.29) via SMTP by point.osg.gov.bc.ca, id smtpda06853; Fri Nov 3 06:41:33 2000 Received: (from uucp@localhost) by passer.osg.gov.bc.ca (8.11.0/8.9.1) id eA3EfWP12547; Fri, 3 Nov 2000 06:41:32 -0800 (PST) Received: from cwsys9.cwsent.com(10.2.2.1), claiming to be "cwsys.cwsent.com" via SMTP by passer9.cwsent.com, id smtpdJ12543; Fri Nov 3 06:40:39 2000 Received: (from uucp@localhost) by cwsys.cwsent.com (8.11.1/8.9.1) id eA3Eebp39614; Fri, 3 Nov 2000 06:40:37 -0800 (PST) Message-Id: <200011031440.eA3Eebp39614@cwsys.cwsent.com> Received: from localhost.cwsent.com(127.0.0.1), claiming to be "cwsys" via SMTP by localhost.cwsent.com, id smtpdL39601; Fri Nov 3 06:39:50 2000 X-Mailer: exmh version 2.2 06/23/2000 with nmh-1.0.4 Reply-To: Cy Schubert - ITSD Open Systems Group From: Cy Schubert - ITSD Open Systems Group X-OS: FreeBSD 4.1.1-RELEASE X-Sender: cy To: Matt Dillon Cc: Dag-Erling Smorgrav , Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-reply-to: Your message of "Fri, 03 Nov 2000 02:14:51 PST." <200011031014.eA3AEpV45562@earth.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Fri, 03 Nov 2000 06:39:47 -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message <200011031014.eA3AEpV45562@earth.backplane.com>, Matt Dillon writes: > > :> This is a matter of preference (hence the reference to paint) and also the > :> use intended for the system in question. However, the code is not going to > :> be significantly more complex due to this, and I think it's a much better, > :> ie cleaner, way of doing it. > : > :One question that probably interests many of us is, can tuning those > :numbers reduce fsck time? Is fsck time strictly proportional to disk > :size, or does the number of inodes and/or cylinder groups affect it? > : > :DES > :-- > :Dag-Erling Smorgrav - des@ofug.org > > Yes. Increasing the number of bytes per inode will reduce the number > of inodes and thus reduce fsck time. Increasing the number of cylinders > in a group will localize inodes into bigger chunks, reducing seeking > and also thus reduce fsck time. Wouldn't that tend to generally reduce day-to-day performance as well? I suspect that Kirk and co. at CSRG had a good reason for choosing the defaults they did. Regards, Phone: (250)387-8437 Cy Schubert Fax: (250)387-5766 Team Leader, Sun/DEC Team Internet: Cy.Schubert@osg.gov.bc.ca Open Systems Group, ITSD, ISTA Province of BC To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 6:49:55 2000 Delivered-To: freebsd-arch@freebsd.org Received: from flood.ping.uio.no (flood.ping.uio.no [129.240.78.31]) by hub.freebsd.org (Postfix) with ESMTP id CA97437B4F9 for ; Fri, 3 Nov 2000 06:49:51 -0800 (PST) Received: (from des@localhost) by flood.ping.uio.no (8.9.3/8.9.3) id PAA39986; Fri, 3 Nov 2000 15:49:46 +0100 (CET) (envelope-from des@ofug.org) X-URL: http://www.ofug.org/~des/ X-Disclaimer: The views expressed in this message do not necessarily coincide with those of any organisation or company with which I am or have been affiliated. To: Cy Schubert - ITSD Open Systems Group Cc: Matt Dillon , Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011031440.eA3Eebp39614@cwsys.cwsent.com> From: Dag-Erling Smorgrav Date: 03 Nov 2000 15:49:45 +0100 In-Reply-To: Cy Schubert - ITSD Open Systems Group's message of "Fri, 03 Nov 2000 06:39:47 -0800" Message-ID: Lines: 12 User-Agent: Gnus/5.0802 (Gnus v5.8.2) Emacs/20.4 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Cy Schubert - ITSD Open Systems Group writes: > Wouldn't that tend to generally reduce day-to-day performance as well? > I suspect that Kirk and co. at CSRG had a good reason for choosing the > defaults they did. Certainly, but I believe these defaults were chosen nearly ten years ago (if not more) on hardware which we today charitably describe as "antiquated" :) DES -- Dag-Erling Smorgrav - des@ofug.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 6:50:35 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mercury.Sun.COM (mercury.Sun.COM [192.9.25.1]) by hub.freebsd.org (Postfix) with ESMTP id 732ED37B4E5 for ; Fri, 3 Nov 2000 06:50:33 -0800 (PST) Received: from ms-emuc07-01.Germany.Sun.COM ([129.157.128.14]) by mercury.Sun.COM (8.9.3+Sun/8.9.3) with ESMTP id GAA16531; Fri, 3 Nov 2000 06:50:28 -0800 (PST) Received: from sun.com (hacker [129.157.133.195]) by ms-emuc07-01.Germany.Sun.COM (8.9.3+Sun/8.9.3/ENSMAIL,v1.9) with ESMTP id PAA24879; Fri, 3 Nov 2000 15:50:26 +0100 (MET) Message-ID: <3A02D061.225A66CA@sun.com> Date: Fri, 03 Nov 2000 15:49:05 +0100 From: Michael Schuster - Sun Germany Organization: Sun Microsystems, Inc. X-Mailer: Mozilla 4.73 [en] (X11; I; SunOS 5.8 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Cy Schubert - ITSD Open Systems Group Cc: Matt Dillon , Dag-Erling Smorgrav , Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011031440.eA3Eebp39614@cwsys.cwsent.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Cy Schubert - ITSD Open Systems Group wrote: > > Yes. Increasing the number of bytes per inode will reduce the number > > of inodes and thus reduce fsck time. Increasing the number of cylinders > > in a group will localize inodes into bigger chunks, reducing seeking > > and also thus reduce fsck time. > > Wouldn't that tend to generally reduce day-to-day performance as well? > I suspect that Kirk and co. at CSRG had a good reason for choosing the > defaults they did. I don't think you can generalise. It very much depends on what you're doing with your filesystem. Eg. an application writing only log-like data will exercise the FS quite differently from one where many small files are constantly being changed in a random manner, and it will again differ if you use your FS mostly read-only. Finally, this also depends on how the data is organised on disk (I'll only say "big directories"). cheers Michael -- Michael Schuster / Michael.Schuster@sun.com Sun Microsystems GmbH / Sonnenallee 1, D-85551 Heimstetten (+49 89) 46008-2974 / x62974 Recursion, n.: see 'Recursion' To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 6:58:21 2000 Delivered-To: freebsd-arch@freebsd.org Received: from point.osg.gov.bc.ca (point.osg.gov.bc.ca [142.32.102.44]) by hub.freebsd.org (Postfix) with ESMTP id 5AD1737B4CF for ; Fri, 3 Nov 2000 06:58:15 -0800 (PST) Received: (from daemon@localhost) by point.osg.gov.bc.ca (8.8.7/8.8.8) id GAA06927; Fri, 3 Nov 2000 06:58:00 -0800 Received: from passer.osg.gov.bc.ca(142.32.110.29) via SMTP by point.osg.gov.bc.ca, id smtpda06925; Fri Nov 3 06:57:51 2000 Received: (from uucp@localhost) by passer.osg.gov.bc.ca (8.11.0/8.9.1) id eA3EvpH12734; Fri, 3 Nov 2000 06:57:51 -0800 (PST) Received: from cwsys9.cwsent.com(10.2.2.1), claiming to be "cwsys.cwsent.com" via SMTP by passer9.cwsent.com, id smtpdS12725; Fri Nov 3 06:57:40 2000 Received: (from uucp@localhost) by cwsys.cwsent.com (8.11.1/8.9.1) id eA3Evdr39709; Fri, 3 Nov 2000 06:57:39 -0800 (PST) Message-Id: <200011031457.eA3Evdr39709@cwsys.cwsent.com> Received: from localhost.cwsent.com(127.0.0.1), claiming to be "cwsys" via SMTP by localhost.cwsent.com, id smtpde39702; Fri Nov 3 06:56:54 2000 X-Mailer: exmh version 2.2 06/23/2000 with nmh-1.0.4 Reply-To: Cy Schubert - ITSD Open Systems Group From: Cy Schubert - ITSD Open Systems Group X-OS: FreeBSD 4.1.1-RELEASE X-Sender: cy To: Matt Dillon Cc: Marius Bendiksen , Alfred Perlstein , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-reply-to: Your message of "Thu, 02 Nov 2000 19:39:09 PST." <200011030339.eA33d9D43976@earth.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Fri, 03 Nov 2000 06:56:53 -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message <200011030339.eA33d9D43976@earth.backplane.com>, Matt Dillon writes: > > : > :> Indirect blocks aren't relevant if you are using a large block size, > :> because there are few enough of them the OS has no problem caching > :> them. > : > :The problem is related to highly random access, as the indirect blocks > :will tend to get pushed out of the cache on occasion, requiring multiple > :seeks when the file is being accessed. Using extents will solve this. > : > :> 32K block size 4MB > : > :Note that these 4MB are better spent on caching real data than they are on > :compensating for the absence of extents in the FFS inode. > :.. > : > :> It becomes somewhat more of an issue for a terrabyte-sized database, > :> but still no biggy considering the memory you can get these days. > : > :I reiterate the above point. The kind of memory in question here is really > :way over the top, compared to the 8/16 bytes required to hold an extent > :reference and the bit to indicate that the inode uses such. > : > :Marius > > Lets put things into perspective here. You have a multi-gig or terrabyte > database, and that pretty much means you have to at least a gig of ram > in the machine that's going to be accessing it. Otherwise why bother > with caching at all? > > If you have a machine with a gig of ram, losing 4MB is REALLY not a big > deal. Assuming an Oracle database of that size, you probably want at least 12-16 GB of memory with 80% of it used for SGA. I would think that most other DBMS's would have the same features and requirements. The Veritas filesystem (vxfs) has an option to turn of caching for for a specified filesystem. Otherwise Oracle would cache in the SGA and the filesystem would cache as well. Prior to this, performance wise, you'd be better off using raw slices. The ability to turn of caching of data, while still caching metadata, would be a good mount or tunefs option for applications that perform their own caching, e.g. Oracle. Regards, Phone: (250)387-8437 Cy Schubert Fax: (250)387-5766 Team Leader, Sun/DEC Team Internet: Cy.Schubert@osg.gov.bc.ca Open Systems Group, ITSD, ISTA Province of BC To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 8:10:28 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id 2818537B4C5 for ; Fri, 3 Nov 2000 08:10:26 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YCG34; Fri, 3 Nov 2000 11:10:31 -0500 Reply-To: Randell Jesup To: Dag-Erling Smorgrav Cc: Cy Schubert - ITSD Open Systems Group , Matt Dillon , Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011031440.eA3Eebp39614@cwsys.cwsent.com> From: Randell Jesup Date: 03 Nov 2000 11:14:21 -0500 In-Reply-To: Dag-Erling Smorgrav's message of "03 Nov 2000 15:49:45 +0100" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Dag-Erling Smorgrav writes: >Cy Schubert - ITSD Open Systems Group writes: >> Wouldn't that tend to generally reduce day-to-day performance as well? >> I suspect that Kirk and co. at CSRG had a good reason for choosing the >> defaults they did. > >Certainly, but I believe these defaults were chosen nearly ten years >ago (if not more) on hardware which we today charitably describe as >"antiquated" :) At least 10 years; perhaps more, with primary drives that were probably in the 40-100MB range, no on-drive cache, 100ms seek times, and max throughputs of well under 1MB per second. Average file size may not have changed dramatically (though I'm sure it's gone up), but it has changed, especially thickening of the tail of the distribution (files over a few hundred K or a MB). -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 8:50:54 2000 Delivered-To: freebsd-arch@freebsd.org Received: from androcles.com (androcles.com [204.57.240.10]) by hub.freebsd.org (Postfix) with ESMTP id 65B6537B4E5 for ; Fri, 3 Nov 2000 08:50:51 -0800 (PST) Received: (from dhh@localhost) by androcles.com (8.9.3/8.9.3) id IAA93725; Fri, 3 Nov 2000 08:50:38 -0800 (PST) Message-ID: X-Mailer: XFMail 1.4.0 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit MIME-Version: 1.0 In-Reply-To: Date: Fri, 03 Nov 2000 08:50:38 -0800 (PST) From: "Duane H. Hesser" To: Dag-Erling Smorgrav Subject: Re: Like to commit my diskprep Cc: arch@FreeBSD.ORG, Randell Jesup , Marius Bendiksen , Matt Dillon , Cy Schubert - ITSD Open Systems Group Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On 03-Nov-00 Dag-Erling Smorgrav wrote: ... > > Certainly, but I believe these defaults were chosen nearly ten years > ago (if not more) on hardware which we today charitably describe as > "antiquated" :) > > DES > -- > Dag-Erling Smorgrav - des@ofug.org ... You are too optimistic, when you say "nearly ten years". McKusick, et al's paper "A Fast Filesystem for Unix", which describes the design of the 4.2BSD FFS, and some of the testing upon which it was based, is marked as "Revised July 27, 1983" in my copy of the 4.2BSD manuals printed by Usenix for 4.2BSD. The copy in /usr/share/doc/smm/05.fastfs/ is "Revised February 18, 1984". How time flies when you're having fun. From the paper, disk tests were done on a Vax 750 with Unibus and Massbus controllers and an "Ampex Capricorn 330 Megabyte Winchester disk", with a "bandwidth" on the order of 500 KB/sec. "Antiquated" may be a fair description. That was a big disk in those days, an 8K blocksize gave an order of magnitude transfer speed improvement over a 1K blocksize filesystem, and used almost all of the cpu (with a Unibus controller). No "smart" disks, remember. I don't recall even seeing test results for 16K or larger, and certainly 8K was the largest size which could be configured in 4.2BSD. Perhaps it *is* time to rethink defaults. Proabably should be done at least once every millenium. -------------- Duane H. Hesser dhh@androcles.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 8:54: 4 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id 175C837B4C5 for ; Fri, 3 Nov 2000 08:54:02 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YCG76; Fri, 3 Nov 2000 11:54:08 -0500 Reply-To: Randell Jesup To: Marius Bendiksen Cc: Alfred Perlstein , Matt Dillon , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: From: Randell Jesup Date: 03 Nov 2000 11:57:58 -0500 In-Reply-To: Marius Bendiksen's message of "Thu, 2 Nov 2000 23:29:03 +0100 (CET)" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Marius Bendiksen writes: >Actually, block indirection could be fixed by raping the code to support >the notion of extents. As to allocating in the general locality of the >inode or datablock, that would require you to be within a distance of 1 >track, and on a system with better things to spend its cache on than >indirect blocks, you'll lose some when you hit double or triple indirect, >especially with random access. 1 track? Not really. Modern drives have internal caches and generally aggressively read-ahead. There was an interesting paper in SIGOS (I think) around a year ago about inode locality, forward placement, and storing small files in the inode, and how all of this interacted with modern drives. Also, what is a "track" on a modern drive? ;-) >As a side note, I've thought about abusing the actual inodes themselves to >hold single indirect blocks. Opinions, apart from the general evilness of >abusing the structures in such a fashion? That sounds good. >> Yes, patches would be nice. :) > >Patches cannot be formed until a general consensus exists on how the >patches should do things if and when an enterprising soul made them. >Otherwise, they stand a good chance at being rejected based on some, >possibly relevant, objection to how they work. > >Also, such patches are likely best formed by the same people that are >currently suggesting doing a variety of other things for disklabel and >friends. I'm willing to help on this, though my time may be limited. I have _extensive_ FS experience from my Amiga days, and also was the primary disk-driver person and SCSI expert, and also did "archive" filesystems for Scala. I've never hacked the internals of ufs, however, but I do know the issues. -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 10:23:32 2000 Delivered-To: freebsd-arch@freebsd.org Received: from mail.wgate.com (mail.wgate.com [38.219.83.4]) by hub.freebsd.org (Postfix) with ESMTP id A283B37B4C5 for ; Fri, 3 Nov 2000 10:23:30 -0800 (PST) Received: from jesup.eng.tvol.net ([10.32.2.26]) by mail.wgate.com with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) id VT2YCH44; Fri, 3 Nov 2000 13:23:35 -0500 Reply-To: Randell Jesup To: Barry Pederson Cc: arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011021632.eA2GWZ138286@earth.backplane.com> <3A01C6A3.23C4CD61@medicine.nodak.edu> From: Randell Jesup Date: 03 Nov 2000 13:27:26 -0500 In-Reply-To: Barry Pederson's message of "Thu, 02 Nov 2000 13:55:15 -0600" Message-ID: User-Agent: Gnus/5.0807 (Gnus v5.8.7) Emacs/20.7 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Barry Pederson writes: >> The defaults for -b and -f and -c for newfs/etc are WOEFULLY >> out-of-date. See the sysinstall checkin comment I referenced. I use 16K >> myself. It's possible larger might be better, especially for large >> partitions - perhaps make it variable on partition size.... And 16 for cpg >> is truely criminal (can you say thousands of spare root blocks? And very >> slow newfs?) > >The man page for newfs says: > >--------- >BUGS > The boot code of FreeBSD assumes that the file system that carries the > kernel has blocks of 8 kilobytes and fragments of 1 kilobyte. You will > not be able to boot from a file system that uses another size. >--------- > >So I'd assume you have to be careful to leave the root at the current >defaults? (or make the boot code smarter?) From my system: a: 819200 0 4.2BSD 4096 16384 75 # (Cyl. 0 - 812*) So I think that documentation is buggy, not the code. Anyone else care to confirm? -- Randell Jesup, Worldgate Communications, ex-Scala, ex-Amiga OS team ('88-94) rjesup@wgate.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 10:27:28 2000 Delivered-To: freebsd-arch@freebsd.org Received: from earth.backplane.com (placeholder-dcat-1076843399.broadbandoffice.net [64.47.83.135]) by hub.freebsd.org (Postfix) with ESMTP id D6A5E37B479 for ; Fri, 3 Nov 2000 10:27:24 -0800 (PST) Received: (from dillon@localhost) by earth.backplane.com (8.11.1/8.9.3) id eA3IRIM47704; Fri, 3 Nov 2000 10:27:18 -0800 (PST) (envelope-from dillon) Date: Fri, 3 Nov 2000 10:27:18 -0800 (PST) From: Matt Dillon Message-Id: <200011031827.eA3IRIM47704@earth.backplane.com> To: Cy Schubert - ITSD Open Systems Group Cc: Dag-Erling Smorgrav , Marius Bendiksen , Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: <200011031440.eA3Eebp39614@cwsys.cwsent.com> Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG :> :> Yes. Increasing the number of bytes per inode will reduce the number :> of inodes and thus reduce fsck time. Increasing the number of cylinders :> in a group will localize inodes into bigger chunks, reducing seeking :> and also thus reduce fsck time. : :Wouldn't that tend to generally reduce day-to-day performance as well? :I suspect that Kirk and co. at CSRG had a good reason for choosing the :defaults they did. : :Regards, Phone: (250)387-8437 :Cy Schubert Fax: (250)387-5766 Increasing the bytes per inode simply reduces the number of inodes available on the filesystem -- that is, reduces the number of files you can create. There is no performance penalty but, of course, it makes it more likely that you will run out of inodes. Setting the number of bytes per inode depends heavily on what you intend to use a filesytem for. Increasing the number of cylinders in a cylinder group has the effect of breaking the overall disk into fewer, larger chunks rather then many smaller chunks. It does has no effect on the numebr of inodes. Ask yourself whether it makes sense to break a modern drive up into 2000 chunks, which is what newfs currently does in this example: serv05:/home/dillon# newfs /dev/da1s1d Warning: 116 sector(s) in last cylinder unallocated /dev/da1s1d: 143363980 sectors in 35001 cylinders of 1 tracks, 4096 sectors 70001.9MB in 2188 cyl groups (16 c/g, 32.00MB/g, 4096 i/g) Or whether it makes to break it up into, say, 400 chunks, which is what you get with: serv05:/home/dillon# newfs -c 89 /dev/da1s1d Warning: 116 sector(s) in last cylinder unallocated /dev/da1s1d: 143363980 sectors in 35001 cylinders of 1 tracks, 4096 sectors 70001.9MB in 394 cyl groups (89 c/g, 178.00MB/g, 22400 i/g) super-block backups (for fsck -b #) at: The answer should be self evident. It is true that the bitmaps are larger, but it can also easily be argued that the bitmaps you get with the default are far too small. Modern drives cache data as do filesystems. On a modern system it is not necessary to have the inodes be as close to the data blocks as it once was. -Matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 10:49:45 2000 Delivered-To: freebsd-arch@freebsd.org Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (Postfix) with ESMTP id 4C73237B4CF for ; Fri, 3 Nov 2000 10:49:41 -0800 (PST) Received: (from daemon@localhost) by smtp02.primenet.com (8.9.3/8.9.3) id LAA08446; Fri, 3 Nov 2000 11:45:39 -0700 (MST) Received: from usr07.primenet.com(206.165.6.207) via SMTP by smtp02.primenet.com, id smtpdAAAvfaOCq; Fri Nov 3 11:45:31 2000 Received: (from tlambert@localhost) by usr07.primenet.com (8.8.5/8.8.5) id LAA20781; Fri, 3 Nov 2000 11:49:15 -0700 (MST) From: Terry Lambert Message-Id: <200011031849.LAA20781@usr07.primenet.com> Subject: Re: Like to commit my diskprep To: dhh@androcles.com (Duane H. Hesser) Date: Fri, 3 Nov 2000 18:49:14 +0000 (GMT) Cc: des@ofug.org (Dag-Erling Smorgrav), arch@FreeBSD.ORG, rjesup@wgate.com (Randell Jesup), mbendiks@eunet.no (Marius Bendiksen), dillon@earth.backplane.com (Matt Dillon), Cy.Schubert@uumail.gov.bc.ca (Cy Schubert - ITSD Open Systems Group) In-Reply-To: from "Duane H. Hesser" at Nov 03, 2000 08:50:38 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG > You are too optimistic, when you say "nearly ten years". McKusick, > et al's paper "A Fast Filesystem for Unix", which describes the > design of the 4.2BSD FFS, and some of the testing upon which it > was based, is marked as "Revised July 27, 1983" in my copy of the > 4.2BSD manuals printed by Usenix for 4.2BSD. The copy in > /usr/share/doc/smm/05.fastfs/ is "Revised February 18, 1984". [ ... ] > Perhaps it *is* time to rethink defaults. Proabably should be done > at least once every millenium. The defaults were rethought once. The fictional geometry that FreeBSD uses today ignore sthe track-to-track seek times. This was changed in the mid 1990's to account for disks that lied about their geometry. Using the fictional geometry, all of the optimizations related to seek reduction, one of the primary foci of the FFS paper, are disabled. The block/cluster issue is one of fragmentation, not really of optimization. The ability to do clusters effectively prevents fragmentation, taking it down to 50% of a frag size, on average; for a 4k block size FS, this is 512b, and for an 8k, it's 1k, yielding unused frag averages of 256b and 512b, respectively. The clustering code is mean to ensure relative locality of much data within a single cylinder, while not penalizing the multiple process locality case with too much seeking or rotational latency. Some of the assumptions there have changed, such as inverted track recording order, to ensure sequential reads are in the cache (basically, prefetched by starting to read wherever you seek to, and returning data once the sector you had asked for has been read), and the number of sectors in a track. One potential performance benefit for large files would be to increase the number of sectors in a cylinder group. This is not necessarily as big a win as you might think, since most DB access is random. The only thing that would change this is if the average data object was larger than one cluster in size; even then, the actual optimial cluster size would really depend; for fixed size records, it would be "exactly one record". If the records weren't stored on at least 512b boundaries, this would turn into a loss, since, given random I/O (poor locality), you will still span a cluster at the start and end an average probability of: oddness = (cluster_size%rec_size) ? 1 : 0 r_per_c = (cluster_size/record_size) P = r_per_c : oddness + .5 ... the probability of a record spanning any given boundary. Generally, a DB that valued speed would frag storage by never spanning a physical media blocksize boundary (though for small records, it would probably put more than one per block, if it had a reasonable confidence that it wouldn't have to move them during a record expansion later). If you guys want to experiment with log and block structured FSs, by all means, do so, but I don't think that you'll end up optimizing things, unless your average object size is larger than 1/2 of a cluster in size, which to my mind, is a very large object indeed (18k/36k for 4k/8k block size @ 9 blocks per cluster). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 10:57:32 2000 Delivered-To: freebsd-arch@freebsd.org Received: from critter.freebsd.dk (flutter.freebsd.dk [212.242.40.147]) by hub.freebsd.org (Postfix) with ESMTP id EA0D537B4D7; Fri, 3 Nov 2000 10:57:20 -0800 (PST) Received: from critter (localhost [127.0.0.1]) by critter.freebsd.dk (8.11.1/8.9.3) with ESMTP id eA3IutY02019; Fri, 3 Nov 2000 19:56:55 +0100 (CET) (envelope-from phk@critter.freebsd.dk) To: Joerg Micheel Cc: Zhiui Zhang , freebsd-hackers@freebsd.org, arch@freebsd.org Subject: Re: granularity of gettimeofday() In-Reply-To: Your message of "Sat, 04 Nov 2000 07:27:39 +1300." <20001104072739.K26626@cs.waikato.ac.nz> Date: Fri, 03 Nov 2000 19:56:55 +0100 Message-ID: <2017.973277815@critter> From: Poul-Henning Kamp Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Cc'ed to arch because of it's hopefully informative contents. In message <20001104072739.K26626@cs.waikato.ac.nz>, Joerg Micheel writes: >On Fri, Nov 03, 2000 at 07:21:21PM +0100, Poul-Henning Kamp wrote: >> The long answer is: FreeBSD can deliver time with a resolution of >> 1/2^32 nanosecond = 232.8E-21 seconds. The actual resolution is >> much worse, because the hardware usually doesn't provide any better >> than about a nanosecond at best these days. On certain hardware >> only about a microsecond of actual resolution is available. > >Why could't gettimeofday be tuned to provide nanosecond resolution >in reality, with the assistance of the TSC register ? You get access to the nanosecond resolution timestamps in userland with the clock_gettime(3) API. >I understand >this does not work on a 486, but that's not the platform people >ask for such a resolution. With every celeron today exceeding 300 >MHz, 3 nanoseconds is a reality. Of course, there will be syscall >overhead, understood. I guess I need to explain it much more carefully: We're can, and usually do, MUCH better than gettimeofday(), it all depends on what hardware is available. FreeBSD uses an abstraction called a "timecounter". A timecounter is a binary counter running at a fixed frequency. Currently the following pieces of hardware can be timecounters on the i386 platform: The CPU's TSC Frequency as fast as your CPU, ie, up to above a GHz these days. Resolution of timestamps will be 1/CPU clock, ie down to below 1 nanosecond. The i8254 on the motherboard. Runs at 1193182 Hz. Resolution of timestamps will be 838 nanoseconds. The PIIX on newer chipsets Runs at 3579545 Hz. Resolution will be 279 nsec. The loran timecounter. 5 MHz, 200nsec. The xrpu timecounter. 100 MHz, 10nsec. This one can timestamp external events without interrupt jitter, ie: +/- 10nsec phasenoise on the timestamp. Internal to the kernel, timestamps are operated on in a resolution of nanoseconds with 32bit fractions, ie 232E-21 (= a 4 billionth of a nanosecond), but current APIs only report in either nanoseconds or microseconds. The code in the kernel models the frequency of the hardware counter to parts in 232E-21 (~= 1 second over the lifetime of the universe) so it can more than adequatly model todays atomic standards. I have proven that claim in my lab by using a FreeBSD system (with the xrpu timecounter) to measure the phase and frequency offset of a Cesium standard relative to GPS. A HP5370B (20ps one-shot resolution) was used as control instrumentation and the results agreed well inside the theoretical window of uncertainty. In other words, no matter what hardware you will throw at FreeBSD, it will be able to fully exploit the precision and resolution of that hardware. But the two *really* interesting things about the FreeBSD code is: You can change your timecounter on the fly. This allows the machine to boot using maybe the TSC, then load the bitcode on the xrpu board, initialize the hardware on the xrpu and start to use that as the timecounter. If the hardware can be read atomically, no interrupt locking is used. This means on a multi-CPU system you will not have block interrupts to figure out what time it is, in fact all CPUs can find out what time it is *at the same time*, without interferring with each other. I belive those two features are unique to FreeBSD at this time. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 12:58:56 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (unknown [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id E3CAC37B4C5 for ; Fri, 3 Nov 2000 12:58:46 -0800 (PST) Received: from harmony.village.org (harmony.village.org [10.0.0.6]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA3Kwfn47181; Fri, 3 Nov 2000 13:58:45 -0700 (MST) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id NAA20184; Fri, 3 Nov 2000 13:58:35 -0700 (MST) Message-Id: <200011032058.NAA20184@harmony.village.org> To: Dag-Erling Smorgrav Subject: Re: Like to commit my diskprep Cc: arch@FreeBSD.org In-reply-to: Your message of "03 Nov 2000 15:49:45 +0100." References: <200011031440.eA3Eebp39614@cwsys.cwsent.com> Date: Fri, 03 Nov 2000 13:58:35 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message Dag-Erling Smorgrav writes: : Cy Schubert - ITSD Open Systems Group writes: : > Wouldn't that tend to generally reduce day-to-day performance as well? : > I suspect that Kirk and co. at CSRG had a good reason for choosing the : > defaults they did. : : Certainly, but I believe these defaults were chosen nearly ten years : ago (if not more) on hardware which we today charitably describe as : "antiquated" :) IIRC, and I've not checked the historical Unix cdrom that a buddy has to be sure, these defaults haven't changed since the 4.2BSD release, which was 1983. I'm almost positive they were in place for the 4.3BSD release in 1986 which our university cs department upgraded to in early 1987. ufs's 8k block size is very deeply rooted in history. I know that SunOS 4.0 had it and I'm almost certain that it had 3.5 had it as default as well (but I only did a couple 3.2 and 3.5 installs before upgrading to 4.0.1). SunOS 3.x and 4.x came from BSD 4.2 plus a bunch of hacking. So we're pushing closer to 20 years ago rather than 10 years ago. BSD 4.0 was released in October 1980, per bsd-family-tree. I think that these defaults were there as well, but again, I've not looked at the cd to make sure. Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 13: 4:59 2000 Delivered-To: freebsd-arch@freebsd.org Received: from rover.village.org (unknown [204.144.255.49]) by hub.freebsd.org (Postfix) with ESMTP id EFE3537B4D7 for ; Fri, 3 Nov 2000 13:04:53 -0800 (PST) Received: from harmony.village.org (harmony.village.org [10.0.0.6]) by rover.village.org (8.11.0/8.11.0) with ESMTP id eA3L4on47221; Fri, 3 Nov 2000 14:04:51 -0700 (MST) (envelope-from imp@harmony.village.org) Received: from harmony.village.org (localhost.village.org [127.0.0.1]) by harmony.village.org (8.9.3/8.8.3) with ESMTP id OAA20257; Fri, 3 Nov 2000 14:04:50 -0700 (MST) Message-Id: <200011032104.OAA20257@harmony.village.org> To: Randell Jesup Subject: Re: Like to commit my diskprep Cc: Barry Pederson , arch@FreeBSD.ORG In-reply-to: Your message of "03 Nov 2000 13:27:26 EST." References: <200011021632.eA2GWZ138286@earth.backplane.com> <3A01C6A3.23C4CD61@medicine.nodak.edu> Date: Fri, 03 Nov 2000 14:04:50 -0700 From: Warner Losh Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message Randell Jesup writes: : a: 819200 0 4.2BSD 4096 16384 75 # (Cyl. 0 - 812*) We boot off of 512/4096 file systems all the time for our embedded systems that have 64M or 32M CF cards in them. This is on 3.4 and 4.x based systems. These use bog standard boot blocks and just a reduced set of FreeBSD utilities as the only compression (well, and a termcap that is really tiny). Warner To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 19:16:47 2000 Delivered-To: freebsd-arch@freebsd.org Received: from homer.softweyr.com (bsdconspiracy.net [208.187.122.220]) by hub.freebsd.org (Postfix) with ESMTP id 13F6437B4CF for ; Fri, 3 Nov 2000 19:16:44 -0800 (PST) Received: from [127.0.0.1] (helo=softweyr.com ident=Fools trust ident!) by homer.softweyr.com with esmtp (Exim 3.16 #1) id 13rtp3-0000XA-00; Fri, 03 Nov 2000 20:16:49 -0700 Message-ID: <3A037FA1.75B1C328@softweyr.com> Date: Fri, 03 Nov 2000 20:16:49 -0700 From: Wes Peters Organization: Softweyr LLC X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.2.12 i386) X-Accept-Language: en MIME-Version: 1.0 To: Marius Bendiksen Cc: Randell Jesup , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Marius Bendiksen wrote: > > Not to bring out the paint early, but I have a suggestion, should the > concept of hog partitions be introduced (regardless of whether you stick > them in disklabel, diskpart, or yadisklabel3): make it possible to define > multiple variable-sized partitions, with percentile ratio to use from the > hog-space, ie. > > / 64m > /var 128m > /usr 50% > /home 50% > > That would yield more flexibility, at a (hopefully) low additional cost in > code. With a 'disk hog' partition editor, this is simple to do. As you add each partition, it shows you the amount of space now allocated to the disk hog. If you have /home assigned as the disk hog, when you finish adding /, /var, and swap space, simply choose about half of the space show for /home as your space for /usr. The same holds true for when you want, say 25% for /usr, 25% for /var, and the rest for /home. Make /home the hog, add / and swap, then divide the space in /home by 4. Add that much for /usr and /var and save. All of these were well-known tricks on SunOS. -- "Where am I, and what am I doing in this handbasket?" Wes Peters Softweyr LLC wes@softweyr.com http://softweyr.com/ To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message From owner-freebsd-arch Fri Nov 3 21:34:16 2000 Delivered-To: freebsd-arch@freebsd.org Received: from point.osg.gov.bc.ca (point.osg.gov.bc.ca [142.32.102.44]) by hub.freebsd.org (Postfix) with ESMTP id 52AF437B4CF for ; Fri, 3 Nov 2000 21:34:13 -0800 (PST) Received: (from daemon@localhost) by point.osg.gov.bc.ca (8.8.7/8.8.8) id VAA09240; Fri, 3 Nov 2000 21:33:27 -0800 Received: from passer.osg.gov.bc.ca(142.32.110.29) via SMTP by point.osg.gov.bc.ca, id smtpda09236; Fri Nov 3 21:33:14 2000 Received: (from uucp@localhost) by passer.osg.gov.bc.ca (8.11.0/8.9.1) id eA45XBZ20359; Fri, 3 Nov 2000 21:33:11 -0800 (PST) Received: from cwsys9.cwsent.com(10.2.2.1), claiming to be "cwsys.cwsent.com" via SMTP by passer9.cwsent.com, id smtpdk20355; Fri Nov 3 21:32:15 2000 Received: (from uucp@localhost) by cwsys.cwsent.com (8.11.1/8.9.1) id eA45WEo65619; Fri, 3 Nov 2000 21:32:14 -0800 (PST) Message-Id: <200011040532.eA45WEo65619@cwsys.cwsent.com> Received: from localhost.cwsent.com(127.0.0.1), claiming to be "cwsys" via SMTP by localhost.cwsent.com, id smtpdY65613; Fri Nov 3 21:31:37 2000 X-Mailer: exmh version 2.2 06/23/2000 with nmh-1.0.4 Reply-To: Cy Schubert - ITSD Open Systems Group From: Cy Schubert - ITSD Open Systems Group X-OS: FreeBSD 4.1.1-RELEASE X-Sender: cy To: Randell Jesup Cc: Marius Bendiksen , Alfred Perlstein , Matt Dillon , arch@FreeBSD.ORG Subject: Re: Like to commit my diskprep In-reply-to: Your message of "03 Nov 2000 11:57:58 EST." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Fri, 03 Nov 2000 21:31:36 -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG In message , Randell Jes up writes: > >> Yes, patches would be nice. :) > > > >Patches cannot be formed until a general consensus exists on how the > >patches should do things if and when an enterprising soul made them. > >Otherwise, they stand a good chance at being rejected based on some, > >possibly relevant, objection to how they work. > > > >Also, such patches are likely best formed by the same people that are > >currently suggesting doing a variety of other things for disklabel and > >friends. > > I'm willing to help on this, though my time may be limited. I have > _extensive_ FS experience from my Amiga days, and also was the primary > disk-driver person and SCSI expert, and also did "archive" filesystems for > Scala. I've never hacked the internals of ufs, however, but I do know the > issues. Would this become a new filesystem, e.g. extfs v.s. ext2fs -- ufs v.s. ufs2? If not, would there be some kind of conversion procedure or would existing filesystem have to be backed up, reinitialised and restored? Or instead, would the filesystem convert to the new format on the fly? In other words how would this affect our "customers"? Regards, Phone: (250)387-8437 Cy Schubert Fax: (250)387-5766 Team Leader, Sun/DEC Team Internet: Cy.Schubert@osg.gov.bc.ca Open Systems Group, ITSD, ISTA Province of BC To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message