From owner-svn-src-all@FreeBSD.ORG  Thu Mar 22 04:52:51 2012
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C9DE51065670;
	Thu, 22 Mar 2012 04:52:51 +0000 (UTC) (envelope-from alc@FreeBSD.org)
Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c])
	by mx1.freebsd.org (Postfix) with ESMTP id B24938FC0A;
	Thu, 22 Mar 2012 04:52:51 +0000 (UTC)
Received: from svn.freebsd.org (localhost [127.0.0.1])
	by svn.freebsd.org (8.14.4/8.14.4) with ESMTP id q2M4qpQO008187;
	Thu, 22 Mar 2012 04:52:51 GMT (envelope-from alc@svn.freebsd.org)
Received: (from alc@localhost)
	by svn.freebsd.org (8.14.4/8.14.4/Submit) id q2M4qp2H008178;
	Thu, 22 Mar 2012 04:52:51 GMT (envelope-from alc@svn.freebsd.org)
Message-Id: <201203220452.q2M4qp2H008178@svn.freebsd.org>
From: Alan Cox <alc@FreeBSD.org>
Date: Thu, 22 Mar 2012 04:52:51 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-all@freebsd.org,
	svn-src-head@freebsd.org
X-SVN-Group: head
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Cc: 
Subject: svn commit: r233291 - in head/sys: amd64/amd64 amd64/include
	i386/i386 i386/include kern sys vm
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
	user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 22 Mar 2012 04:52:51 -0000

Author: alc
Date: Thu Mar 22 04:52:51 2012
New Revision: 233291
URL: http://svn.freebsd.org/changeset/base/233291

Log:
  Handle spurious page faults that may occur in no-fault sections of the
  kernel.
  
  When access restrictions are added to a page table entry, we flush the
  corresponding virtual address mapping from the TLB.  In contrast, when
  access restrictions are removed from a page table entry, we do not
  flush the virtual address mapping from the TLB.  This is exactly as
  recommended in AMD's documentation.  In effect, when access
  restrictions are removed from a page table entry, AMD's MMUs will
  transparently refresh a stale TLB entry.  In short, this saves us from
  having to perform potentially costly TLB flushes.  In contrast,
  Intel's MMUs are allowed to generate a spurious page fault based upon
  the stale TLB entry.  Usually, such spurious page faults are handled
  by vm_fault() without incident.  However, when we are executing
  no-fault sections of the kernel, we are not allowed to execute
  vm_fault().  This change introduces special-case handling for spurious
  page faults that occur in no-fault sections of the kernel.
  
  In collaboration with:	kib
  Tested by:		gibbs (an earlier version)
  
  I would also like to acknowledge Hiroki Sato's assistance in
  diagnosing this problem.
  
  MFC after:	1 week

Modified:
  head/sys/amd64/amd64/trap.c
  head/sys/amd64/include/proc.h
  head/sys/i386/i386/trap.c
  head/sys/i386/include/proc.h
  head/sys/kern/kern_sysctl.c
  head/sys/kern/subr_uio.c
  head/sys/sys/proc.h
  head/sys/vm/vm_fault.c

Modified: head/sys/amd64/amd64/trap.c
==============================================================================
--- head/sys/amd64/amd64/trap.c	Thu Mar 22 04:40:22 2012	(r233290)
+++ head/sys/amd64/amd64/trap.c	Thu Mar 22 04:52:51 2012	(r233291)
@@ -301,26 +301,6 @@ trap(struct trapframe *frame)
 	}
 
 	code = frame->tf_err;
-	if (type == T_PAGEFLT) {
-		/*
-		 * If we get a page fault while in a critical section, then
-		 * it is most likely a fatal kernel page fault.  The kernel
-		 * is already going to panic trying to get a sleep lock to
-		 * do the VM lookup, so just consider it a fatal trap so the
-		 * kernel can print out a useful trap message and even get
-		 * to the debugger.
-		 *
-		 * If we get a page fault while holding a non-sleepable
-		 * lock, then it is most likely a fatal kernel page fault.
-		 * If WITNESS is enabled, then it's going to whine about
-		 * bogus LORs with various VM locks, so just skip to the
-		 * fatal trap handling directly.
-		 */
-		if (td->td_critnest != 0 ||
-		    WITNESS_CHECK(WARN_SLEEPOK | WARN_GIANTOK, NULL,
-		    "Kernel page fault") != 0)
-			trap_fatal(frame, frame->tf_addr);
-	}
 
         if (ISPL(frame->tf_cs) == SEL_UPL) {
 		/* user trap */
@@ -653,6 +633,50 @@ trap_pfault(frame, usermode)
 	struct proc *p = td->td_proc;
 	vm_offset_t eva = frame->tf_addr;
 
+	if (__predict_false((td->td_pflags & TDP_NOFAULTING) != 0)) {
+		/*
+		 * Due to both processor errata and lazy TLB invalidation when
+		 * access restrictions are removed from virtual pages, memory
+		 * accesses that are allowed by the physical mapping layer may
+		 * nonetheless cause one spurious page fault per virtual page. 
+		 * When the thread is executing a "no faulting" section that
+		 * is bracketed by vm_fault_{disable,enable}_pagefaults(),
+		 * every page fault is treated as a spurious page fault,
+		 * unless it accesses the same virtual address as the most
+		 * recent page fault within the same "no faulting" section.
+		 */
+		if (td->td_md.md_spurflt_addr != eva ||
+		    (td->td_pflags & TDP_RESETSPUR) != 0) {
+			/*
+			 * Do nothing to the TLB.  A stale TLB entry is
+			 * flushed automatically by a page fault.
+			 */
+			td->td_md.md_spurflt_addr = eva;
+			td->td_pflags &= ~TDP_RESETSPUR;
+			return (0);
+		}
+	} else {
+		/*
+		 * If we get a page fault while in a critical section, then
+		 * it is most likely a fatal kernel page fault.  The kernel
+		 * is already going to panic trying to get a sleep lock to
+		 * do the VM lookup, so just consider it a fatal trap so the
+		 * kernel can print out a useful trap message and even get
+		 * to the debugger.
+		 *
+		 * If we get a page fault while holding a non-sleepable
+		 * lock, then it is most likely a fatal kernel page fault.
+		 * If WITNESS is enabled, then it's going to whine about
+		 * bogus LORs with various VM locks, so just skip to the
+		 * fatal trap handling directly.
+		 */
+		if (td->td_critnest != 0 ||
+		    WITNESS_CHECK(WARN_SLEEPOK | WARN_GIANTOK, NULL,
+		    "Kernel page fault") != 0) {
+			trap_fatal(frame, eva);
+			return (-1);
+		}
+	}
 	va = trunc_page(eva);
 	if (va >= VM_MIN_KERNEL_ADDRESS) {
 		/*

Modified: head/sys/amd64/include/proc.h
==============================================================================
--- head/sys/amd64/include/proc.h	Thu Mar 22 04:40:22 2012	(r233290)
+++ head/sys/amd64/include/proc.h	Thu Mar 22 04:52:51 2012	(r233291)
@@ -46,6 +46,7 @@ struct proc_ldt {
 struct mdthread {
 	int	md_spinlock_count;	/* (k) */
 	register_t md_saved_flags;	/* (k) */
+	register_t md_spurflt_addr;	/* (k) Spurious page fault address. */
 };
 
 struct mdproc {

Modified: head/sys/i386/i386/trap.c
==============================================================================
--- head/sys/i386/i386/trap.c	Thu Mar 22 04:40:22 2012	(r233290)
+++ head/sys/i386/i386/trap.c	Thu Mar 22 04:52:51 2012	(r233291)
@@ -330,28 +330,13 @@ trap(struct trapframe *frame)
 		 * For some Cyrix CPUs, %cr2 is clobbered by
 		 * interrupts.  This problem is worked around by using
 		 * an interrupt gate for the pagefault handler.  We
-		 * are finally ready to read %cr2 and then must
-		 * reenable interrupts.
-		 *
-		 * If we get a page fault while in a critical section, then
-		 * it is most likely a fatal kernel page fault.  The kernel
-		 * is already going to panic trying to get a sleep lock to
-		 * do the VM lookup, so just consider it a fatal trap so the
-		 * kernel can print out a useful trap message and even get
-		 * to the debugger.
-		 *
-		 * If we get a page fault while holding a non-sleepable
-		 * lock, then it is most likely a fatal kernel page fault.
-		 * If WITNESS is enabled, then it's going to whine about
-		 * bogus LORs with various VM locks, so just skip to the
-		 * fatal trap handling directly.
+		 * are finally ready to read %cr2 and conditionally
+		 * reenable interrupts.  If we hold a spin lock, then
+		 * we must not reenable interrupts.  This might be a
+		 * spurious page fault.
 		 */
 		eva = rcr2();
-		if (td->td_critnest != 0 ||
-		    WITNESS_CHECK(WARN_SLEEPOK | WARN_GIANTOK, NULL,
-		    "Kernel page fault") != 0)
-			trap_fatal(frame, eva);
-		else
+		if (td->td_md.md_spinlock_count == 0)
 			enable_intr();
 	}
 
@@ -804,6 +789,50 @@ trap_pfault(frame, usermode, eva)
 	struct thread *td = curthread;
 	struct proc *p = td->td_proc;
 
+	if (__predict_false((td->td_pflags & TDP_NOFAULTING) != 0)) {
+		/*
+		 * Due to both processor errata and lazy TLB invalidation when
+		 * access restrictions are removed from virtual pages, memory
+		 * accesses that are allowed by the physical mapping layer may
+		 * nonetheless cause one spurious page fault per virtual page. 
+		 * When the thread is executing a "no faulting" section that
+		 * is bracketed by vm_fault_{disable,enable}_pagefaults(),
+		 * every page fault is treated as a spurious page fault,
+		 * unless it accesses the same virtual address as the most
+		 * recent page fault within the same "no faulting" section.
+		 */
+		if (td->td_md.md_spurflt_addr != eva ||
+		    (td->td_pflags & TDP_RESETSPUR) != 0) {
+			/*
+			 * Do nothing to the TLB.  A stale TLB entry is
+			 * flushed automatically by a page fault.
+			 */
+			td->td_md.md_spurflt_addr = eva;
+			td->td_pflags &= ~TDP_RESETSPUR;
+			return (0);
+		}
+	} else {
+		/*
+		 * If we get a page fault while in a critical section, then
+		 * it is most likely a fatal kernel page fault.  The kernel
+		 * is already going to panic trying to get a sleep lock to
+		 * do the VM lookup, so just consider it a fatal trap so the
+		 * kernel can print out a useful trap message and even get
+		 * to the debugger.
+		 *
+		 * If we get a page fault while holding a non-sleepable
+		 * lock, then it is most likely a fatal kernel page fault.
+		 * If WITNESS is enabled, then it's going to whine about
+		 * bogus LORs with various VM locks, so just skip to the
+		 * fatal trap handling directly.
+		 */
+		if (td->td_critnest != 0 ||
+		    WITNESS_CHECK(WARN_SLEEPOK | WARN_GIANTOK, NULL,
+		    "Kernel page fault") != 0) {
+			trap_fatal(frame, eva);
+			return (-1);
+		}
+	}
 	va = trunc_page(eva);
 	if (va >= KERNBASE) {
 		/*

Modified: head/sys/i386/include/proc.h
==============================================================================
--- head/sys/i386/include/proc.h	Thu Mar 22 04:40:22 2012	(r233290)
+++ head/sys/i386/include/proc.h	Thu Mar 22 04:52:51 2012	(r233291)
@@ -51,6 +51,7 @@ struct proc_ldt {
 struct mdthread {
 	int	md_spinlock_count;	/* (k) */
 	register_t md_saved_flags;	/* (k) */
+	register_t md_spurflt_addr;	/* (k) Spurious page fault address. */
 };
 
 struct mdproc {

Modified: head/sys/kern/kern_sysctl.c
==============================================================================
--- head/sys/kern/kern_sysctl.c	Thu Mar 22 04:40:22 2012	(r233290)
+++ head/sys/kern/kern_sysctl.c	Thu Mar 22 04:52:51 2012	(r233291)
@@ -1294,8 +1294,8 @@ kernel_sysctlbyname(struct thread *td, c
 static int
 sysctl_old_user(struct sysctl_req *req, const void *p, size_t l)
 {
-	int error = 0;
 	size_t i, len, origidx;
+	int error;
 
 	origidx = req->oldidx;
 	req->oldidx += l;
@@ -1316,10 +1316,14 @@ sysctl_old_user(struct sysctl_req *req, 
 	else {
 		if (i > len - origidx)
 			i = len - origidx;
-		error = copyout(p, (char *)req->oldptr + origidx, i);
+		if (req->lock == REQ_WIRED) {
+			error = copyout_nofault(p, (char *)req->oldptr +
+			    origidx, i);
+		} else
+			error = copyout(p, (char *)req->oldptr + origidx, i);
+		if (error != 0)
+			return (error);
 	}
-	if (error)
-		return (error);
 	if (i < l)
 		return (ENOMEM);
 	return (0);

Modified: head/sys/kern/subr_uio.c
==============================================================================
--- head/sys/kern/subr_uio.c	Thu Mar 22 04:40:22 2012	(r233290)
+++ head/sys/kern/subr_uio.c	Thu Mar 22 04:52:51 2012	(r233291)
@@ -187,8 +187,12 @@ uiomove_faultflag(void *cp, int n, struc
 
 	/* XXX does it make a sense to set TDP_DEADLKTREAT for UIO_SYSSPACE ? */
 	newflags = TDP_DEADLKTREAT;
-	if (uio->uio_segflg == UIO_USERSPACE && nofault)
-		newflags |= TDP_NOFAULTING;
+	if (uio->uio_segflg == UIO_USERSPACE && nofault) {
+		/*
+		 * Fail if a non-spurious page fault occurs.
+		 */
+		newflags |= TDP_NOFAULTING | TDP_RESETSPUR;
+	}
 	save = curthread_pflags_set(newflags);
 
 	while (n > 0 && uio->uio_resid) {

Modified: head/sys/sys/proc.h
==============================================================================
--- head/sys/sys/proc.h	Thu Mar 22 04:40:22 2012	(r233290)
+++ head/sys/sys/proc.h	Thu Mar 22 04:52:51 2012	(r233291)
@@ -417,6 +417,7 @@ do {									\
 #define	TDP_IGNSUSP	0x00800000 /* Permission to ignore the MNTK_SUSPEND* */
 #define	TDP_AUDITREC	0x01000000 /* Audit record pending on thread */
 #define	TDP_RFPPWAIT	0x02000000 /* Handle RFPPWAIT on syscall exit */
+#define	TDP_RESETSPUR	0x04000000 /* Reset spurious page fault history. */
 
 /*
  * Reasons that the current thread can not be run yet.

Modified: head/sys/vm/vm_fault.c
==============================================================================
--- head/sys/vm/vm_fault.c	Thu Mar 22 04:40:22 2012	(r233290)
+++ head/sys/vm/vm_fault.c	Thu Mar 22 04:52:51 2012	(r233291)
@@ -1468,11 +1468,17 @@ vm_fault_additional_pages(m, rbehind, ra
 	return i;
 }
 
+/*
+ * Block entry into the machine-independent layer's page fault handler by
+ * the calling thread.  Subsequent calls to vm_fault() by that thread will
+ * return KERN_PROTECTION_FAILURE.  Enable machine-dependent handling of
+ * spurious page faults. 
+ */
 int
 vm_fault_disable_pagefaults(void)
 {
 
-	return (curthread_pflags_set(TDP_NOFAULTING));
+	return (curthread_pflags_set(TDP_NOFAULTING | TDP_RESETSPUR));
 }
 
 void