From owner-freebsd-hackers@FreeBSD.ORG Fri Jul 16 23:33:45 2010 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B90E01065675; Fri, 16 Jul 2010 23:33:45 +0000 (UTC) (envelope-from alc@cs.rice.edu) Received: from mail.cs.rice.edu (mail.cs.rice.edu [128.42.1.31]) by mx1.freebsd.org (Postfix) with ESMTP id 743D08FC08; Fri, 16 Jul 2010 23:33:45 +0000 (UTC) Received: from mail.cs.rice.edu (localhost.localdomain [127.0.0.1]) by mail.cs.rice.edu (Postfix) with ESMTP id E101E2C2ADC; Fri, 16 Jul 2010 18:33:44 -0500 (CDT) X-Virus-Scanned: by amavis-2.4.0 at mail.cs.rice.edu Received: from mail.cs.rice.edu ([127.0.0.1]) by mail.cs.rice.edu (mail.cs.rice.edu [127.0.0.1]) (amavisd-new, port 10024) with LMTP id psDCIW6jhpCh; Fri, 16 Jul 2010 18:33:37 -0500 (CDT) Received: from [10.209.194.97] (unknown [10.209.194.97]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.cs.rice.edu (Postfix) with ESMTP id 1578F2C2ACE; Fri, 16 Jul 2010 18:33:37 -0500 (CDT) Message-ID: <4C40EC32.3030700@cs.rice.edu> Date: Fri, 16 Jul 2010 18:33:06 -0500 From: Alan Cox User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.10) Gecko/20100512 Thunderbird/3.0.5 MIME-Version: 1.0 To: Peter Jeremy References: <20100714090454.1177b96b@ernst.jennejohn.org> <20100716093041.GB26367@server.vk2pj.dyndns.org> In-Reply-To: <20100716093041.GB26367@server.vk2pj.dyndns.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: alc@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: disk I/O, VFS hirunningspace X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Jul 2010 23:33:45 -0000 Peter Jeremy wrote: > Regarding vfs.lorunningspace and vfs.hirunningspace... > > On 2010-Jul-15 13:52:43 -0500, Alan Cox wrote: > >> Keep in mind that we still run on some fairly small systems with limited I/O >> capabilities, e.g., a typical arm platform. More generally, with the range >> of systems that FreeBSD runs on today, any particular choice of constants is >> going to perform poorly for someone. If nothing else, making these sysctls >> a function of the buffer cache size is probably better than any particular >> constants. >> > > That sounds reasonable but brings up a related issue - the buffer > cache. Given the unified VM system no longer needs a traditional Unix > buffer cache, what is the buffer cache still used for? Today, it is essentially a mapping cache. So, what does that mean? After you've set aside a modest amount of physical memory for the kernel to hold its own internal data structures, all of the remaining physical memory can potentially be used to cache file data. However, on many architectures this is far more memory than the kernel can instantaneously access. Consider i386. You might have 4+ GB of physical memory, but the kernel address space is (by default) only 1 GB. So, at any instant in time, only a fraction of the physical memory is instantaneously accessible to the kernel. In general, to access an arbitrary physical page, the kernel is going to have to replace an existing virtual-to-physical mapping in its address space with one for the desired page. (Generally speaking, on most architectures, even the kernel can't directly access physical memory that isn't mapped by a virtual address.) The buffer cache is essentially a region of the kernel address space that is dedicated to mappings to physical pages containing cached file data. As applications access files, the kernel dynamically maps (and unmaps) physical pages containing cached file data into this region. Once the desired pages are mapped, then read(2) and write(2) can essentially "bcopy" from the buffer cache mapping to the application's buffer. (Understand that this buffer cache mapping is a prerequisite for the copy out to occur.) So, why did I call it a mapping cache? There is generally locality in the access to file data. So, rather than map and unmap the desired physical pages on every read and write, the mappings to file data are allowed to persist and are managed much like many other kinds of caches. When the kernel needs to map a new set of file pages, it finds an older, not-so-recently used mapping and destroys it, allowing those kernel virtual addresses to be remapped to the new pages. So far, I've used i386 as a motivating example. What of other architectures? Most 64-bit machines take advantage of their large address space by implementing some form of "direct map" that provides instantaneous access to all of physical memory. (Again, I use "instantaneous" to mean that the kernel doesn't have to dynamically create a virtual-to-physical mapping before being able to access the data.) On these machines, you could, in principle, use the direct map to implement the "bcopy" to the application's buffer. So, what is the point of the buffer cache on these machines? A trivial benefit is that the file pages are mapped contiguously in the buffer cache. Even though the underlying physical pages may be scattered throughout the physical address space, they are mapped contiguously. So, the "bcopy" doesn't need to worry about every page boundary, only buffer boundaries. The buffer cache also plays a role in the page replacement mechanism. Once mapped into the buffer cache, a page is "wired", that is, it removed from the paging lists, where the page daemon could reclaim it. However, a page in the buffer cache should really be thought of as being "active". In fact, when a page is unmapped from the buffer cache, it is placed at the tail of the virtual memory system's "inactive" list. The same place where the virtual memory system would place a physical page that it is transitioning from "active" to "inactive". If an application later performs a read(2) from or write(2) to the same page, that page will be removed from the "inactive" list and mapped back into the buffer cache. So, the mapping and unmapping process contributes to creating an LRU-ordered "inactive" queue. Finally, the buffer cache limits the amount of dirty file system data that is cached in memory. > ... Is the current > tuning formula still reasonable (for virtually all current systems > it's basically 10MB + 10% RAM)? It's probably still good enough. However, this is not a statement for which I have supporting data. So, I reserve the right to change my opinion. :-) Consider what the buffer cache now does. It's just a mapping cache. Increasing the buffer cache size doesn't affect (much) the amount of physical memory available for caching file data. So, unlike ancient times, increasing the size of the buffer cache isn't going to have nearly the same effect on the amount of actual I/O that your machine does. For some workloads, increasing the buffer cache size may have greater impact on CPU overhead than I/O overhead. For example, all of your file data might fit into physical memory, but you're doing random read accesses to it. That would cause the buffer cache to thrash, even though you wouldn't do any actual I/O. Unfortunately, mapping pages into the buffer cache isn't trivial. For example, it requires every processor to be interrupted to invalidate some entries from its TLB. (This is a so-called "TLB shootdown".) > ... How can I measure the effectiveness > of the buffer cache? > > I'm not sure that I can give you a short answer to this question. > The buffer cache size is also very tightly constrained (vfs.hibufspace > and vfs.lobufspace differ by 64KB) and at least one of the underlying > tuning parameters have comments at variance with current reality: > In: > > * MAXBSIZE - Filesystems are made out of blocks of at most MAXBSIZE bytes > * per block. MAXBSIZE may be made larger without effecting > ... > * > * BKVASIZE - Nominal buffer space per buffer, in bytes. BKVASIZE is the > ... > * The default is 16384, roughly 2x the block size used by a > * normal UFS filesystem. > */ > #define MAXBSIZE 65536 /* must be power of 2 */ > #define BKVASIZE 16384 /* must be power of 2 */ > > There's no mention of the 64KiB limit in newfs(8) and I recall seeing > occasional comments from people who have either tried or suggested > trying larger blocksizes. I believe that larger than 64KB would fail an assertion. > Likewise, the default UFS blocksize has > been 16KiB for quite a while. Are the comments still valid and, if so, > should BKVASIZE be doubled to 32768 and a suitable note added to newfs(8) > regarding the maximum block size? > > If I recall correctly, increasing BKVASIZE would only reduce the number buffer headers. In other words, it might avoid wasting some memory on buffer headers that won't be used. Alan