From owner-svn-src-stable-11@freebsd.org  Thu Jul 21 20:11:45 2016
Return-Path: <owner-svn-src-stable-11@freebsd.org>
Delivered-To: svn-src-stable-11@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0F396BA19C2;
 Thu, 21 Jul 2016 20:11:45 +0000 (UTC) (envelope-from alc@FreeBSD.org)
Received: from repo.freebsd.org (repo.freebsd.org
 [IPv6:2610:1c1:1:6068::e6a:0])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id D842C11A7;
 Thu, 21 Jul 2016 20:11:44 +0000 (UTC) (envelope-from alc@FreeBSD.org)
Received: from repo.freebsd.org ([127.0.1.37])
 by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id u6LKBisq019893;
 Thu, 21 Jul 2016 20:11:44 GMT (envelope-from alc@FreeBSD.org)
Received: (from alc@localhost)
 by repo.freebsd.org (8.15.2/8.15.2/Submit) id u6LKBinI019892;
 Thu, 21 Jul 2016 20:11:44 GMT (envelope-from alc@FreeBSD.org)
Message-Id: <201607212011.u6LKBinI019892@repo.freebsd.org>
X-Authentication-Warning: repo.freebsd.org: alc set sender to alc@FreeBSD.org
 using -f
From: Alan Cox <alc@FreeBSD.org>
Date: Thu, 21 Jul 2016 20:11:44 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
 svn-src-stable@freebsd.org, svn-src-stable-11@freebsd.org
Subject: svn commit: r303161 - stable/11/sys/vm
X-SVN-Group: stable-11
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-BeenThere: svn-src-stable-11@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SVN commit messages for only the 11-stable src tree
 <svn-src-stable-11.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-stable-11>, 
 <mailto:svn-src-stable-11-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-stable-11/>
List-Post: <mailto:svn-src-stable-11@freebsd.org>
List-Help: <mailto:svn-src-stable-11-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-stable-11>, 
 <mailto:svn-src-stable-11-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Jul 2016 20:11:45 -0000

Author: alc
Date: Thu Jul 21 20:11:43 2016
New Revision: 303161
URL: https://svnweb.freebsd.org/changeset/base/303161

Log:
  MFC r302980
    Break up vm_fault()'s implementation of the read-ahead and delete-behind
    optimizations into two distinct pieces.  The first piece consists of the
    code that should only be performed once per page fault and requires the
    map to be locked.  The second piece consists of the code that should be
    performed each time a pager is called on an object in the shadow chain.
    (This second piece expects the map to be unlocked.)
  
    Previously, the entire implementation could be executed multiple times.
    Moreover, the second and subsequent executions would occur with the map
    unlocked.  Usually, the ensuing unsynchronized accesses to the map were
    harmless because the map was not changing.  Nonetheless, it was possible
    for a use-after-free error to occur, where vm_fault() wrote to a freed
    map entry.  This change corrects that problem.
  
  Approved by:	re (gjb)

Modified:
  stable/11/sys/vm/vm_fault.c
Directory Properties:
  stable/11/   (props changed)

Modified: stable/11/sys/vm/vm_fault.c
==============================================================================
--- stable/11/sys/vm/vm_fault.c	Thu Jul 21 19:27:04 2016	(r303160)
+++ stable/11/sys/vm/vm_fault.c	Thu Jul 21 20:11:43 2016	(r303161)
@@ -291,14 +291,17 @@ vm_fault_hold(vm_map_t map, vm_offset_t 
 	int hardfault;
 	struct faultstate fs;
 	struct vnode *vp;
+	vm_offset_t e_end, e_start;
 	vm_page_t m;
-	int ahead, behind, cluster_offset, error, locked;
+	int ahead, behind, cluster_offset, error, locked, rv;
+	u_char behavior;
 
 	hardfault = 0;
 	growstack = TRUE;
 	PCPU_INC(cnt.v_vm_faults);
 	fs.vp = NULL;
 	faultcount = 0;
+	nera = -1;
 
 RetryFault:;
 
@@ -549,29 +552,25 @@ fast_failed:
 
 readrest:
 		/*
-		 * We have either allocated a new page or found an existing
-		 * page that is only partially valid.
-		 *
-		 * Attempt to fault-in the page if there is a chance that the
-		 * pager has it, and potentially fault in additional pages
-		 * at the same time.
+		 * If the pager for the current object might have the page,
+		 * then determine the number of additional pages to read and
+		 * potentially reprioritize previously read pages for earlier
+		 * reclamation.  These operations should only be performed
+		 * once per page fault.  Even if the current pager doesn't
+		 * have the page, the number of additional pages to read will
+		 * apply to subsequent objects in the shadow chain.
 		 */
-		if (fs.object->type != OBJT_DEFAULT) {
-			int rv;
-			u_char behavior = vm_map_entry_behavior(fs.entry);
-
+		if (fs.object->type != OBJT_DEFAULT && nera == -1 &&
+		    !P_KILLED(curproc)) {
+			KASSERT(fs.lookup_still_valid, ("map unlocked"));
 			era = fs.entry->read_ahead;
-			if (behavior == MAP_ENTRY_BEHAV_RANDOM ||
-			    P_KILLED(curproc)) {
-				behind = 0;
+			behavior = vm_map_entry_behavior(fs.entry);
+			if (behavior == MAP_ENTRY_BEHAV_RANDOM) {
 				nera = 0;
-				ahead = 0;
 			} else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) {
-				behind = 0;
 				nera = VM_FAULT_READ_AHEAD_MAX;
-				ahead = nera;
 				if (vaddr == fs.entry->next_read)
-					vm_fault_dontneed(&fs, vaddr, ahead);
+					vm_fault_dontneed(&fs, vaddr, nera);
 			} else if (vaddr == fs.entry->next_read) {
 				/*
 				 * This is a sequential fault.  Arithmetically
@@ -580,42 +579,51 @@ readrest:
 				 * number of pages is "# of sequential faults
 				 * x (read ahead min + 1) + read ahead min"
 				 */
-				behind = 0;
 				nera = VM_FAULT_READ_AHEAD_MIN;
 				if (era > 0) {
 					nera += era + 1;
 					if (nera > VM_FAULT_READ_AHEAD_MAX)
 						nera = VM_FAULT_READ_AHEAD_MAX;
 				}
-				ahead = nera;
 				if (era == VM_FAULT_READ_AHEAD_MAX)
-					vm_fault_dontneed(&fs, vaddr, ahead);
+					vm_fault_dontneed(&fs, vaddr, nera);
 			} else {
 				/*
-				 * This is a non-sequential fault.  Request a
-				 * cluster of pages that is aligned to a
-				 * VM_FAULT_READ_DEFAULT page offset boundary
-				 * within the object.  Alignment to a page
-				 * offset boundary is more likely to coincide
-				 * with the underlying file system block than
-				 * alignment to a virtual address boundary.
+				 * This is a non-sequential fault.
 				 */
-				cluster_offset = fs.pindex %
-				    VM_FAULT_READ_DEFAULT;
-				behind = ulmin(cluster_offset,
-				    atop(vaddr - fs.entry->start));
 				nera = 0;
-				ahead = VM_FAULT_READ_DEFAULT - 1 -
-				    cluster_offset;
 			}
-			ahead = ulmin(ahead, atop(fs.entry->end - vaddr) - 1);
-			if (era != nera)
+			if (era != nera) {
+				/*
+				 * A read lock on the map suffices to update
+				 * the read ahead count safely.
+				 */
 				fs.entry->read_ahead = nera;
+			}
 
 			/*
-			 * Call the pager to retrieve the data, if any, after
-			 * releasing the lock on the map.  We hold a ref on
-			 * fs.object and the pages are exclusive busied.
+			 * Prepare for unlocking the map.  Save the map
+			 * entry's start and end addresses, which are used to
+			 * optimize the size of the pager operation below.
+			 * Even if the map entry's addresses change after
+			 * unlocking the map, using the saved addresses is
+			 * safe.
+			 */
+			e_start = fs.entry->start;
+			e_end = fs.entry->end;
+		}
+
+		/*
+		 * Call the pager to retrieve the page if there is a chance
+		 * that the pager has it, and potentially retrieve additional
+		 * pages at the same time.
+		 */
+		if (fs.object->type != OBJT_DEFAULT) {
+			/*
+			 * We have either allocated a new page or found an
+			 * existing page that is only partially valid.  We
+			 * hold a reference on fs.object and the page is
+			 * exclusive busied.
 			 */
 			unlock_map(&fs);
 
@@ -656,6 +664,35 @@ vnode_locked:
 			 * Page in the requested page and hint the pager,
 			 * that it may bring up surrounding pages.
 			 */
+			if (nera == -1 || behavior == MAP_ENTRY_BEHAV_RANDOM ||
+			    P_KILLED(curproc)) {
+				behind = 0;
+				ahead = 0;
+			} else {
+				/* Is this a sequential fault? */
+				if (nera > 0) {
+					behind = 0;
+					ahead = nera;
+				} else {
+					/*
+					 * Request a cluster of pages that is
+					 * aligned to a VM_FAULT_READ_DEFAULT
+					 * page offset boundary within the
+					 * object.  Alignment to a page offset
+					 * boundary is more likely to coincide
+					 * with the underlying file system
+					 * block than alignment to a virtual
+					 * address boundary.
+					 */
+					cluster_offset = fs.pindex %
+					    VM_FAULT_READ_DEFAULT;
+					behind = ulmin(cluster_offset,
+					    atop(vaddr - e_start));
+					ahead = VM_FAULT_READ_DEFAULT - 1 -
+					    cluster_offset;
+				}
+				ahead = ulmin(ahead, atop(e_end - vaddr) - 1);
+			}
 			rv = vm_pager_get_pages(fs.object, &fs.m, 1,
 			    &behind, &ahead);
 			if (rv == VM_PAGER_OK) {