From owner-svn-src-head@FreeBSD.ORG  Fri Jan 16 18:17:10 2015
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 6ACFE56A;
 Fri, 16 Jan 2015 18:17:10 +0000 (UTC)
Received: from svn.freebsd.org (svn.freebsd.org
 [IPv6:2001:1900:2254:2068::e6a:0])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 564B1974;
 Fri, 16 Jan 2015 18:17:10 +0000 (UTC)
Received: from svn.freebsd.org ([127.0.1.70])
 by svn.freebsd.org (8.14.9/8.14.9) with ESMTP id t0GIHAjo005784;
 Fri, 16 Jan 2015 18:17:10 GMT (envelope-from alc@FreeBSD.org)
Received: (from alc@localhost)
 by svn.freebsd.org (8.14.9/8.14.9/Submit) id t0GIHA9U005783;
 Fri, 16 Jan 2015 18:17:10 GMT (envelope-from alc@FreeBSD.org)
Message-Id: <201501161817.t0GIHA9U005783@svn.freebsd.org>
X-Authentication-Warning: svn.freebsd.org: alc set sender to alc@FreeBSD.org
 using -f
From: Alan Cox <alc@FreeBSD.org>
Date: Fri, 16 Jan 2015 18:17:10 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-head@freebsd.org
Subject: svn commit: r277255 - head/sys/vm
X-SVN-Group: head
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Jan 2015 18:17:10 -0000

Author: alc
Date: Fri Jan 16 18:17:09 2015
New Revision: 277255
URL: https://svnweb.freebsd.org/changeset/base/277255

Log:
  Revamp the default page clustering strategy that is used by the page fault
  handler.  For roughly twenty years, the page fault handler has used the
  same basic strategy: Fetch a fixed number of non-resident pages both ahead
  and behind the virtual page that was faulted on.  Over the years,
  alternative strategies have been implemented for optimizing the handling
  of random and sequential access patterns, but the only change to the
  default strategy has been to increase the number of pages read ahead to 7
  and behind to 8.
  
  The problem with the default page clustering strategy becomes apparent
  when you look at how it behaves on the code section of an executable or
  shared library.  (To simplify the following explanation, I'm going to
  ignore the read that is performed to obtain the header and assume that no
  pages are resident at the start of execution.)  Suppose that we have a
  code section consisting of 32 pages.  Further, suppose that we access
  pages 4, 28, and 16 in that order.  Under the default page clustering
  strategy, we page fault three times and perform three I/O operations,
  because the first and second page faults only read a truncated cluster of
  12 pages.  In contrast, if we access pages 8, 24, and 16 in that order, we
  only fault twice and perform two I/O operations, because the first and
  second page faults read a full cluster of 16 pages.  In general, truncated
  clusters are more common than full clusters.
  
  To address this problem, this revision changes the default page clustering
  strategy to align the start of the cluster to a page offset within the vm
  object that is a multiple of the cluster size.  This results in many fewer
  truncated clusters.  Returning to our example, if we now access pages 4,
  28, and 16 in that order, the cluster that is read to satisfy the page
  fault on page 28 will now include page 16.  So, the access to page 16 will
  no longer page fault and perform an I/O operation.
  
  Since the revised default page clustering strategy is typically reading
  more pages at a time, we are likely to read a few more pages that are
  never accessed.  However, for the various programs that we looked at,
  including clang, emacs, firefox, and openjdk, the reduction in the number
  of page faults and I/O operations far outweighed the increase in the
  number of pages that are never accessed.  Moreover, the extra resident
  pages allowed for many more superpage mappings.  For example, if we look
  at the execution of clang during a buildworld, the number of (hard) page
  faults on the code section drops by 26%, the number of superpage mappings
  increases by about 29,000, but the number of never accessed pages only
  increases from 30.38% to 33.66%.  Finally, this leads to a small but
  measureable reduction in execution time.
  
  In collaboration with:	Emily Pettigrew <ejp1@rice.edu>
  Differential Revision:	https://reviews.freebsd.org/D1500
  Reviewed by:	jhb, kib
  MFC after:	6 weeks

Modified:
  head/sys/vm/vm_fault.c

Modified: head/sys/vm/vm_fault.c
==============================================================================
--- head/sys/vm/vm_fault.c	Fri Jan 16 17:41:21 2015	(r277254)
+++ head/sys/vm/vm_fault.c	Fri Jan 16 18:17:09 2015	(r277255)
@@ -108,6 +108,7 @@ __FBSDID("$FreeBSD$");
 static int vm_fault_additional_pages(vm_page_t, int, int, vm_page_t *, int *);
 
 #define	VM_FAULT_READ_BEHIND	8
+#define	VM_FAULT_READ_DEFAULT	(1 + VM_FAULT_READ_AHEAD_INIT)
 #define	VM_FAULT_READ_MAX	(1 + VM_FAULT_READ_AHEAD_MAX)
 #define	VM_FAULT_NINCR		(VM_FAULT_READ_MAX / VM_FAULT_READ_BEHIND)
 #define	VM_FAULT_SUM		(VM_FAULT_NINCR * (VM_FAULT_NINCR + 1) / 2)
@@ -292,7 +293,6 @@ vm_fault_hold(vm_map_t map, vm_offset_t 
     int fault_flags, vm_page_t *m_hold)
 {
 	vm_prot_t prot;
-	long ahead, behind;
 	int alloc_req, era, faultcount, nera, reqpage, result;
 	boolean_t growstack, is_first_object_locked, wired;
 	int map_generation;
@@ -302,7 +302,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t 
 	struct faultstate fs;
 	struct vnode *vp;
 	vm_page_t m;
-	int locked, error;
+	int ahead, behind, cluster_offset, error, locked;
 
 	hardfault = 0;
 	growstack = TRUE;
@@ -555,45 +555,59 @@ readrest:
 			int rv;
 			u_char behavior = vm_map_entry_behavior(fs.entry);
 
+			era = fs.entry->read_ahead;
 			if (behavior == MAP_ENTRY_BEHAV_RANDOM ||
 			    P_KILLED(curproc)) {
 				behind = 0;
+				nera = 0;
 				ahead = 0;
 			} else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) {
 				behind = 0;
-				ahead = atop(fs.entry->end - vaddr) - 1;
-				if (ahead > VM_FAULT_READ_AHEAD_MAX)
-					ahead = VM_FAULT_READ_AHEAD_MAX;
+				nera = VM_FAULT_READ_AHEAD_MAX;
+				ahead = nera;
 				if (fs.pindex == fs.entry->next_read)
 					vm_fault_cache_behind(&fs,
 					    VM_FAULT_READ_MAX);
-			} else {
+			} else if (fs.pindex == fs.entry->next_read) {
 				/*
-				 * If this is a sequential page fault, then
-				 * arithmetically increase the number of pages
-				 * in the read-ahead window.  Otherwise, reset
-				 * the read-ahead window to its smallest size.
+				 * This is a sequential fault.  Arithmetically
+				 * increase the requested number of pages in
+				 * the read-ahead window.  The requested
+				 * number of pages is "# of sequential faults
+				 * x (read ahead min + 1) + read ahead min"
 				 */
-				behind = atop(vaddr - fs.entry->start);
-				if (behind > VM_FAULT_READ_BEHIND)
-					behind = VM_FAULT_READ_BEHIND;
-				ahead = atop(fs.entry->end - vaddr) - 1;
-				era = fs.entry->read_ahead;
-				if (fs.pindex == fs.entry->next_read) {
-					nera = era + behind;
+				behind = 0;
+				nera = VM_FAULT_READ_AHEAD_MIN;
+				if (era > 0) {
+					nera += era + 1;
 					if (nera > VM_FAULT_READ_AHEAD_MAX)
 						nera = VM_FAULT_READ_AHEAD_MAX;
-					behind = 0;
-					if (ahead > nera)
-						ahead = nera;
-					if (era == VM_FAULT_READ_AHEAD_MAX)
-						vm_fault_cache_behind(&fs,
-						    VM_FAULT_CACHE_BEHIND);
-				} else if (ahead > VM_FAULT_READ_AHEAD_MIN)
-					ahead = VM_FAULT_READ_AHEAD_MIN;
-				if (era != ahead)
-					fs.entry->read_ahead = ahead;
+				}
+				ahead = nera;
+				if (era == VM_FAULT_READ_AHEAD_MAX)
+					vm_fault_cache_behind(&fs,
+					    VM_FAULT_CACHE_BEHIND);
+			} else {
+				/*
+				 * This is a non-sequential fault.  Request a
+				 * cluster of pages that is aligned to a
+				 * VM_FAULT_READ_DEFAULT page offset boundary
+				 * within the object.  Alignment to a page
+				 * offset boundary is more likely to coincide
+				 * with the underlying file system block than
+				 * alignment to a virtual address boundary.
+				 */
+				cluster_offset = fs.pindex %
+				    VM_FAULT_READ_DEFAULT;
+				behind = ulmin(cluster_offset,
+				    atop(vaddr - fs.entry->start));
+				nera = 0;
+				ahead = VM_FAULT_READ_DEFAULT - 1 -
+				    cluster_offset;
 			}
+			ahead = ulmin(ahead, atop(fs.entry->end - vaddr) - 1);
+			if (era != nera)
+				fs.entry->read_ahead = nera;
 
 			/*
 			 * Call the pager to retrieve the data, if any, after