From owner-freebsd-stable@FreeBSD.ORG Fri Mar 24 00:29:40 2006 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 57DE416A400; Fri, 24 Mar 2006 00:29:40 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 699BC43D49; Fri, 24 Mar 2006 00:29:39 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.4/8.13.4) with ESMTP id k2O0Tda3069231; Thu, 23 Mar 2006 16:29:39 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.4/8.13.4/Submit) id k2O0Tdsq069230; Thu, 23 Mar 2006 16:29:39 -0800 (PST) Date: Thu, 23 Mar 2006 16:29:39 -0800 (PST) From: Matthew Dillon Message-Id: <200603240029.k2O0Tdsq069230@apollo.backplane.com> To: Gary Palmer References: <200603211607.30372.mi+mx@aldan.algebra.com> <200603231403.36136.mi+mx@aldan.algebra.com> <200603232048.k2NKm4QL067644@apollo.backplane.com> <200603231626.19102.mi+mx@aldan.algebra.com> <200603232316.k2NNGBka068754@apollo.backplane.com> <20060323233204.GA14996@in-addr.com> Cc: stable@freebsd.org Subject: Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Mar 2006 00:29:40 -0000 :I thought one serious advantage to this situation for sequential read :mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to :wait for the clock hands to reap them. On a large Solaris box I used :to have the non-pleasure of running the VM page scan rate was high, and :I suggested to the app vendor that proper use of mmap might reduce that :overhead. Admitedly the files in question were much smaller than the :available memory, but they were also not likely to be referenced again :before the memory had to be reclaimed forcibly by the VM system. : :Is that not the case? Is it better to let the VM system reclaim pages :as needed? : :Thanks, : :Gary madvise() should theoretically have that effect, but it isn't quite so simple a solution. Lets say you have, oh, your workstation, with 1GB of ram, and you run a program which runs several passes on a 900MB data set. Your X session, xterms, gnome, kde, etc etc etc all take around 300MB of working memory. Now that data set could fit into memory if portions of your UI were pushed out of memory. The question is not only how much of that data set should the kernel fit into memory, but which portions of that data set should the kernel fit into memory and whether the kernel should bump out other data (pieces of your UI) to make it fit. Scenario #1: If the kernel fits the whole 900MB data set into memory, the entire rest of the system would have to compete for the remaining 100MB of memory. Your UI would suck rocks. Scenario #2: If the kernel fits 700MB of the data set into memory, and the rest of the system (your UI, etc) is only using 300MB, and the kernel is using MADV_DONTNEED on pages it has already scanned, now your UI works fine but your data set processing program is continuously accessing the disk for all 900MB of data, on every pass, because the kernel is always only keeping the most recently accessed 700MB of the 900MB data set in memory. Scenario #3: Now lets say the kernel decides to keep just the first 700MB of the data set in memory, and not try to cache the last 200MB of the data set. Now your UI works fine, and your processing program runs FOUR TIMES FASTER because it only has to access the disk for the last 200MB of the 900MB data set. -- Now, which of these scenarios does madvise() cover? Does it cover scenario #1? Well, no. the madvise() call that the program makes has no clue whether you intend to play around with your UI every few minutes, or whether you intend to leave the room for 40 minutes. If the kernel guesses wrong, we wind up with one unhappy user. What about scenario #2? There the program decided to call madvise(), and the system dutifully reuses the pages, and you come back an hour later and your data processing program has only done 10 passes out of the 50 passes it needs to do on the data and you are PISSED. Ok. What about scenario #3? Oops. The program has no way of knowing how much memory you need for your UI to be 'happy'. No madvise() call of any sort will make you happy. Not only that, but the KERNEL has no way of knowing that your data processing program intends to make multiple passes on the data set, whether the working set is represented by one file or several files, and even the data processing program itself might not know (you might be running a script which runs a separate program for each pass on the same data set). So much for madvise(). So, no matter what, there will ALWAYS be an unhappy user somewhere. Lets take Mikhail's grep test as an example. If he runs it over and over again, should the kernel be 'optimized' to realize that the same data set is being scanned sequentially, over and over again, ignore the localized sequential nature of the data accesses, and just keep a dedicated portion of that data set in memory to reduce long term disk access? Should it keep the first 1.5GB, or the last 1.5GB, or perhaps it should slice the data set up and keep every other 256MB block? How does it figure out what to cache and when? What if the program suddenly starts accessing the data in a cacheable way? Maybe it should randomly throw some of the data away slowly in the hopes of 'adapting' to the access pattern, which would also require that it throw away most of the 'recently read' data far more quickly to make up for the data it isn't throwing away. Believe it or not, that actually works for certain types of problems, except then you get hung up in a situation where two subsystems are competing with each other for memory resources (like mail server verses web server), and the system is unable to cope as the relative load factors for the competing subsystems change. The problem becomes really complex really fast. This sort of problem is easy to consider in human terms, but virtually impossible to program into a computer with a heuristic or even with specific madvise() calls. The computer simply does not know what the human operator expects from one moment to the next. The problem Mikhail is facing is one where his human assumptions do not match the assumptions the kernel is making on data retention, assumed system load, and the many other factors that the kernel uses to decide what to keep and what to throw away, and when. -- Now, aside from the potential read-ahead issue, which could be a real issue for FreeBSD (but not one really worthy of insulting someone over), there is literally no way for a kernel programmer to engineer the 'perfect' set of optimizations for a system. There are a huge number of pits you can fall into if you try to over-optimize a system. Each optimization adds that much more complexity to an already complex system, and has that much greater a chance to introduce yet another hard-to-find bug. Nearly all operating systems that I know of tend to presume a certain degree of locality of reference for mmap()'d pages. It just so happens that Mikhail's test has no locality of reference. But 99.9% of the programs ever run on a BSD system WILL, so which should the kernel programmer spend all his time coding optimizations for? The 99.9% of the time or the 0.1% of the time? -Matt