From owner-svn-src-all@FreeBSD.ORG  Thu May 24 03:38:48 2012
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EA4661065676;
	Thu, 24 May 2012 03:38:47 +0000 (UTC) (envelope-from alc@FreeBSD.org)
Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c])
	by mx1.freebsd.org (Postfix) with ESMTP id CA6138FC12;
	Thu, 24 May 2012 03:38:47 +0000 (UTC)
Received: from svn.freebsd.org (localhost [127.0.0.1])
	by svn.freebsd.org (8.14.4/8.14.4) with ESMTP id q4O3clCE012992;
	Thu, 24 May 2012 03:38:47 GMT (envelope-from alc@svn.freebsd.org)
Received: (from alc@localhost)
	by svn.freebsd.org (8.14.4/8.14.4/Submit) id q4O3cl6x012988;
	Thu, 24 May 2012 03:38:47 GMT (envelope-from alc@svn.freebsd.org)
Message-Id: <201205240338.q4O3cl6x012988@svn.freebsd.org>
From: Alan Cox <alc@FreeBSD.org>
Date: Thu, 24 May 2012 03:38:47 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
	svn-src-stable@freebsd.org, svn-src-stable-9@freebsd.org
X-SVN-Group: stable-9
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Cc: 
Subject: svn commit: r235876 - stable/9/sys/vm
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
	user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 24 May 2012 03:38:48 -0000

Author: alc
Date: Thu May 24 03:38:47 2012
New Revision: 235876
URL: http://svn.freebsd.org/changeset/base/235876

Log:
  MFC r235230
    Give vm_fault()'s sequential access optimization a makeover.
  
    There are two aspects to the sequential access optimization: (1) read ahead
    of pages that are expected to be accessed in the near future and (2) unmap
    and cache behind of pages that are not expected to be accessed again.  This
    revision changes both aspects.
  
    The read ahead optimization is now more effective.  It starts with the same
    initial read window as before, but arithmetically grows the window on
    sequential page faults.  This can yield increased read bandwidth.  For
    example, on one of my machines, a program using mmap() to read a file that
    is several times larger than the machine's physical memory takes about 17%
    less time to complete.
  
    The unmap and cache behind optimization is now more selectively applied.
    The read ahead window must grow to its maximum size before unmap and cache
    behind is performed.  This significantly reduces the number of times that
    pages are unmapped and cached only to be reactivated a short time later.
  
    The unmap and cache behind optimization now clears each page's referenced
    flag.  Previously, in the case of dirty pages, if the containing file was
    still mapped at the time that the page daemon examined the dirty pages,
    they would be reactivated.
  
    From a stylistic standpoint, this revision also cleanly separates the
    implementation of the read ahead and unmap/cache behind optimizations.

Modified:
  stable/9/sys/vm/vm_fault.c
  stable/9/sys/vm/vm_map.c
  stable/9/sys/vm/vm_map.h
Directory Properties:
  stable/9/sys/   (props changed)

Modified: stable/9/sys/vm/vm_fault.c
==============================================================================
--- stable/9/sys/vm/vm_fault.c	Thu May 24 02:46:35 2012	(r235875)
+++ stable/9/sys/vm/vm_fault.c	Thu May 24 03:38:47 2012	(r235876)
@@ -114,9 +114,11 @@ static int prefault_pageorder[] = {
 static int vm_fault_additional_pages(vm_page_t, int, int, vm_page_t *, int *);
 static void vm_fault_prefault(pmap_t, vm_offset_t, vm_map_entry_t);
 
-#define VM_FAULT_READ_AHEAD 8
-#define VM_FAULT_READ_BEHIND 7
-#define VM_FAULT_READ (VM_FAULT_READ_AHEAD+VM_FAULT_READ_BEHIND+1)
+#define	VM_FAULT_READ_BEHIND	8
+#define	VM_FAULT_READ_MAX	(1 + VM_FAULT_READ_AHEAD_MAX)
+#define	VM_FAULT_NINCR		(VM_FAULT_READ_MAX / VM_FAULT_READ_BEHIND)
+#define	VM_FAULT_SUM		(VM_FAULT_NINCR * (VM_FAULT_NINCR + 1) / 2)
+#define	VM_FAULT_CACHE_BEHIND	(VM_FAULT_READ_BEHIND * VM_FAULT_SUM)
 
 struct faultstate {
 	vm_page_t m;
@@ -132,6 +134,8 @@ struct faultstate {
 	int vfslocked;
 };
 
+static void vm_fault_cache_behind(const struct faultstate *fs, int distance);
+
 static inline void
 release_page(struct faultstate *fs)
 {
@@ -219,13 +223,13 @@ vm_fault_hold(vm_map_t map, vm_offset_t 
     int fault_flags, vm_page_t *m_hold)
 {
 	vm_prot_t prot;
-	int is_first_object_locked, result;
-	boolean_t growstack, wired;
+	long ahead, behind;
+	int alloc_req, era, faultcount, nera, reqpage, result;
+	boolean_t growstack, is_first_object_locked, wired;
 	int map_generation;
 	vm_object_t next_object;
-	vm_page_t marray[VM_FAULT_READ], mt, mt_prev;
+	vm_page_t marray[VM_FAULT_READ_MAX];
 	int hardfault;
-	int faultcount, ahead, behind, alloc_req;
 	struct faultstate fs;
 	struct vnode *vp;
 	int locked, error;
@@ -235,7 +239,7 @@ vm_fault_hold(vm_map_t map, vm_offset_t 
 	PCPU_INC(cnt.v_vm_faults);
 	fs.vp = NULL;
 	fs.vfslocked = 0;
-	faultcount = behind = 0;
+	faultcount = reqpage = 0;
 
 RetryFault:;
 
@@ -443,75 +447,47 @@ readrest:
 		 */
 		if (TRYPAGER) {
 			int rv;
-			int reqpage = 0;
 			u_char behavior = vm_map_entry_behavior(fs.entry);
 
 			if (behavior == MAP_ENTRY_BEHAV_RANDOM ||
 			    P_KILLED(curproc)) {
+				behind = 0;
 				ahead = 0;
+			} else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) {
 				behind = 0;
+				ahead = atop(fs.entry->end - vaddr) - 1;
+				if (ahead > VM_FAULT_READ_AHEAD_MAX)
+					ahead = VM_FAULT_READ_AHEAD_MAX;
+				if (fs.pindex == fs.entry->next_read)
+					vm_fault_cache_behind(&fs,
+					    VM_FAULT_READ_MAX);
 			} else {
-				behind = (vaddr - fs.entry->start) >> PAGE_SHIFT;
-				if (behind > VM_FAULT_READ_BEHIND)
-					behind = VM_FAULT_READ_BEHIND;
-
-				ahead = ((fs.entry->end - vaddr) >> PAGE_SHIFT) - 1;
-				if (ahead > VM_FAULT_READ_AHEAD)
-					ahead = VM_FAULT_READ_AHEAD;
-			}
-			is_first_object_locked = FALSE;
-			if ((behavior == MAP_ENTRY_BEHAV_SEQUENTIAL ||
-			     (behavior != MAP_ENTRY_BEHAV_RANDOM &&
-			      fs.pindex >= fs.entry->lastr &&
-			      fs.pindex < fs.entry->lastr + VM_FAULT_READ)) &&
-			    (fs.first_object == fs.object ||
-			     (is_first_object_locked = VM_OBJECT_TRYLOCK(fs.first_object))) &&
-			    fs.first_object->type != OBJT_DEVICE &&
-			    fs.first_object->type != OBJT_PHYS &&
-			    fs.first_object->type != OBJT_SG) {
-				vm_pindex_t firstpindex;
-
-				if (fs.first_pindex < 2 * VM_FAULT_READ)
-					firstpindex = 0;
-				else
-					firstpindex = fs.first_pindex - 2 * VM_FAULT_READ;
-				mt = fs.first_object != fs.object ?
-				    fs.first_m : fs.m;
-				KASSERT(mt != NULL, ("vm_fault: missing mt"));
-				KASSERT((mt->oflags & VPO_BUSY) != 0,
-				    ("vm_fault: mt %p not busy", mt));
-				mt_prev = vm_page_prev(mt);
-
 				/*
-				 * note: partially valid pages cannot be 
-				 * included in the lookahead - NFS piecemeal
-				 * writes will barf on it badly.
+				 * If this is a sequential page fault, then
+				 * arithmetically increase the number of pages
+				 * in the read-ahead window.  Otherwise, reset
+				 * the read-ahead window to its smallest size.
 				 */
-				while ((mt = mt_prev) != NULL &&
-				    mt->pindex >= firstpindex &&
-				    mt->valid == VM_PAGE_BITS_ALL) {
-					mt_prev = vm_page_prev(mt);
-					if (mt->busy ||
-					    (mt->oflags & VPO_BUSY))
-						continue;
-					vm_page_lock(mt);
-					if (mt->hold_count ||
-					    mt->wire_count) {
-						vm_page_unlock(mt);
-						continue;
-					}
-					pmap_remove_all(mt);
-					if (mt->dirty != 0)
-						vm_page_deactivate(mt);
-					else
-						vm_page_cache(mt);
-					vm_page_unlock(mt);
-				}
-				ahead += behind;
-				behind = 0;
+				behind = atop(vaddr - fs.entry->start);
+				if (behind > VM_FAULT_READ_BEHIND)
+					behind = VM_FAULT_READ_BEHIND;
+				ahead = atop(fs.entry->end - vaddr) - 1;
+				era = fs.entry->read_ahead;
+				if (fs.pindex == fs.entry->next_read) {
+					nera = era + behind;
+					if (nera > VM_FAULT_READ_AHEAD_MAX)
+						nera = VM_FAULT_READ_AHEAD_MAX;
+					behind = 0;
+					if (ahead > nera)
+						ahead = nera;
+					if (era == VM_FAULT_READ_AHEAD_MAX)
+						vm_fault_cache_behind(&fs,
+						    VM_FAULT_CACHE_BEHIND);
+				} else if (ahead > VM_FAULT_READ_AHEAD_MIN)
+					ahead = VM_FAULT_READ_AHEAD_MIN;
+				if (era != ahead)
+					fs.entry->read_ahead = ahead;
 			}
-			if (is_first_object_locked)
-				VM_OBJECT_UNLOCK(fs.first_object);
 
 			/*
 			 * Call the pager to retrieve the data, if any, after
@@ -882,7 +858,7 @@ vnode_locked:
 	 * without holding a write lock on it.
 	 */
 	if (hardfault)
-		fs.entry->lastr = fs.pindex + faultcount - behind;
+		fs.entry->next_read = fs.pindex + faultcount - reqpage;
 
 	if ((prot & VM_PROT_WRITE) != 0 ||
 	    (fault_flags & VM_FAULT_DIRTY) != 0) {
@@ -975,6 +951,60 @@ vnode_locked:
 }
 
 /*
+ * Speed up the reclamation of up to "distance" pages that precede the
+ * faulting pindex within the first object of the shadow chain.
+ */
+static void
+vm_fault_cache_behind(const struct faultstate *fs, int distance)
+{
+	vm_object_t first_object, object;
+	vm_page_t m, m_prev;
+	vm_pindex_t pindex;
+
+	object = fs->object;
+	VM_OBJECT_LOCK_ASSERT(object, MA_OWNED);
+	first_object = fs->first_object;
+	if (first_object != object) {
+		if (!VM_OBJECT_TRYLOCK(first_object)) {
+			VM_OBJECT_UNLOCK(object);
+			VM_OBJECT_LOCK(first_object);
+			VM_OBJECT_LOCK(object);
+		}
+	}
+	if (first_object->type != OBJT_DEVICE &&
+	    first_object->type != OBJT_PHYS && first_object->type != OBJT_SG) {
+		if (fs->first_pindex < distance)
+			pindex = 0;
+		else
+			pindex = fs->first_pindex - distance;
+		if (pindex < OFF_TO_IDX(fs->entry->offset))
+			pindex = OFF_TO_IDX(fs->entry->offset);
+		m = first_object != object ? fs->first_m : fs->m;
+		KASSERT((m->oflags & VPO_BUSY) != 0,
+		    ("vm_fault_cache_behind: page %p is not busy", m));
+		m_prev = vm_page_prev(m);
+		while ((m = m_prev) != NULL && m->pindex >= pindex &&
+		    m->valid == VM_PAGE_BITS_ALL) {
+			m_prev = vm_page_prev(m);
+			if (m->busy != 0 || (m->oflags & VPO_BUSY) != 0)
+				continue;
+			vm_page_lock(m);
+			if (m->hold_count == 0 && m->wire_count == 0) {
+				pmap_remove_all(m);
+				vm_page_aflag_clear(m, PGA_REFERENCED);
+				if (m->dirty != 0)
+					vm_page_deactivate(m);
+				else
+					vm_page_cache(m);
+			}
+			vm_page_unlock(m);
+		}
+	}
+	if (first_object != object)
+		VM_OBJECT_UNLOCK(first_object);
+}
+
+/*
  * vm_fault_prefault provides a quick way of clustering
  * pagefaults into a processes address space.  It is a "cousin"
  * of vm_map_pmap_enter, except it runs at page fault time instead

Modified: stable/9/sys/vm/vm_map.c
==============================================================================
--- stable/9/sys/vm/vm_map.c	Thu May 24 02:46:35 2012	(r235875)
+++ stable/9/sys/vm/vm_map.c	Thu May 24 03:38:47 2012	(r235876)
@@ -1300,6 +1300,8 @@ charged:
 	new_entry->protection = prot;
 	new_entry->max_protection = max;
 	new_entry->wired_count = 0;
+	new_entry->read_ahead = VM_FAULT_READ_AHEAD_INIT;
+	new_entry->next_read = OFF_TO_IDX(offset);
 
 	KASSERT(cred == NULL || !ENTRY_CHARGED(new_entry),
 	    ("OVERCOMMIT: vm_map_insert leaks vm_map %p", new_entry));

Modified: stable/9/sys/vm/vm_map.h
==============================================================================
--- stable/9/sys/vm/vm_map.h	Thu May 24 02:46:35 2012	(r235875)
+++ stable/9/sys/vm/vm_map.h	Thu May 24 03:38:47 2012	(r235876)
@@ -112,8 +112,9 @@ struct vm_map_entry {
 	vm_prot_t protection;		/* protection code */
 	vm_prot_t max_protection;	/* maximum protection */
 	vm_inherit_t inheritance;	/* inheritance */
+	uint8_t read_ahead;		/* pages in the read-ahead window */
 	int wired_count;		/* can be paged if = 0 */
-	vm_pindex_t lastr;		/* last read */
+	vm_pindex_t next_read;		/* index of the next sequential read */
 	struct ucred *cred;		/* tmp storage for creator ref */
 };
 
@@ -330,6 +331,14 @@ long vmspace_wired_count(struct vmspace 
 #define	VM_FAULT_DIRTY 2		/* Dirty the page; use w/VM_PROT_COPY */
 
 /*
+ * Initially, mappings are slightly sequential.  The maximum window size must
+ * account for the map entry's "read_ahead" field being defined as an uint8_t.
+ */
+#define	VM_FAULT_READ_AHEAD_MIN		7
+#define	VM_FAULT_READ_AHEAD_INIT	15
+#define	VM_FAULT_READ_AHEAD_MAX		min(atop(MAXPHYS) - 1, UINT8_MAX)
+
+/*
  * The following "find_space" options are supported by vm_map_find()
  */
 #define	VMFS_NO_SPACE		0	/* don't find; use the given range */