From owner-freebsd-arch@freebsd.org  Tue Aug  4 18:14:50 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 24D4A9B2993
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Tue,  4 Aug 2015 18:14:50 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id F40C91EB9
 for <freebsd-arch@freebsd.org>; Tue,  4 Aug 2015 18:14:49 +0000 (UTC)
 (envelope-from jhb@freebsd.org)
Received: from ralph.baldwin.cx (75-48-78-19.lightspeed.cncrca.sbcglobal.net
 [75.48.78.19])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 145B4B93A
 for <freebsd-arch@freebsd.org>; Tue,  4 Aug 2015 14:14:48 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: 'freebsd-arch' <freebsd-arch@freebsd.org>
Subject: Supporting cross-debugging vmcores in libkvm
Date: Tue, 04 Aug 2015 10:56:09 -0700
Message-ID: <3121152.ujdxFEovO3@ralph.baldwin.cx>
User-Agent: KMail/4.14.3 (FreeBSD/10.2-PRERELEASE; KDE/4.14.3; amd64; ; )
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Tue, 04 Aug 2015 14:14:48 -0400 (EDT)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 04 Aug 2015 18:14:50 -0000

Many debuggers (recent gdb and lldb) support cross-architecture debugging 
just fine.  My current WIP port of kgdb to gdb7 supports cross-debugging for
remote targets already, but I wanted it to also support cross-debugging for
vmcores.

The existing libkvm/kgdb code in the tree has some limited support for
cross-debugging.  It requires building a custom libkvm (e.g. libkvm-i386.a)
and custom kgdb for each target platform.  However, gdb (and lldb) both
support multiple targets in a single binary, so I'd like to have a single
kgdb binary that can cross-debug anything.

I started hacking on libkvm last weekend and have a prototype that I've used
(along with some patches to my kgdb port) to debug an amd64 vmcore on an
i386 machine and vice versa.

To do this I've made some additions to the libkvm API:

1) A new 'kvaddr_t' type represents a kernel virtual address.  This is
   similar to the psaddr_t type used for MI process addresses in userland
   debugging.  I almost reused psaddr_t directly, but that would have made
   <libkvm.h> depend on <sys/procfs.h>.  Instead, I opted for a separate
   type.  It is currently a uint64_t.

2) A new 'struct kvm_nlist'.  This is a stripped-down version of
   'struct nlist' that uses kvadd_t for n_value instead of an unsigned
   long.

3) kvm_native() returns true if an open kvm descriptor is for a native
   kernel and memory image.

4) kvm_nlist2() is like kvm_nlist() but it uses 'struct kvm_nlist'
   instead of 'struct nlist'.  Internally symbol names are always
   resolved to kvaddr_t addresses rather than u_long addresses.
   Native kernels still use _fdnlist() from libc to resolve symbols.
   Cross kernels use a caller supplied function to resolve symbols
   (the older cross code for libkvm required the caller to provide
   a global ps_pglobal_lookup symbol typically provided for
   <proc_service.h>).

5) kvm_open2() is like kvm_openfiles() except that it drops the unused
   'swapfile' argument and adds a new function pointer argument to a
   symbol resolving function.  The function pointer can be NULL in
   which case only native kernels can be opened.  Kernels used with
   /dev/mem or /dev/kmem must be native.

6) kvm_read2() is like kvm_read() except that it uses kvaddr_t
   instead of unsigned long for the kernel virtual address.

Adding new symbols (specifically kvm_nlist2 and kvm_read2) preserves
ABI and API compatibility.  Note that most libkvm functions such as
kvm_getprocs(), etc. only work with native kernels.  I have not yet
done a full sweep to force them to fail for non-native kernels.

Also, the vnet and dpcpu stuff only works for native kernels currently
though that can be fixed at some point in the future.

For the MD backends, I've added a new kvm_arch switch:

struct kvm_arch {
	int	(*ka_probe)(kvm_t *);
	int	(*ka_initvtop)(kvm_t *);
	void	(*ka_freevtop)(kvm_t *);
	int	(*ka_kvatop)(kvm_t *, kvaddr_t, off_t *);
	int	(*ka_uvatop)(kvm_t *, const struct proc *, kvaddr_t, off_t *);
	int	ka_native;
};

Each backend implements the necessary callbacks (uvatop is optional)
and is added to a global linker set that kvm_open2() walks to find the
appropriate kvm_arch for a given kernel + vmcore.  On x86 I've used
separate kvm_arch structures for "plain" vs minidumps.

The backends now have to avoid using native headers.  For ELF handling
this means using libelf instead of <machine/elf.h> and raw mmap().  For
the x86 backends it meant defining some duplicate constants for certain
page table fields since <machine/pmap.h> can't be relied on (e.g.
I386_PG_V instead of PG_V).  I added static assertions in the "native"
case (e.g. building kvm_i386.c on i386) to ensure the duplicate constants
match the originals.

You can see the current WIP patches here:

https://github.com/freebsd/freebsd/compare/master...bsdjhb:kgdb_enhancements

What I'm mostly after is comments on the API, etc.  Once that is settled I
will move forward on converting and/or stubbing the other backends (the
stub route would be to only support other backends on native systems for
now).

Oh, and I do hope to have a 'KGDB' option for the devel/gdb port in the
near future.

-- 
John Baldwin

From owner-freebsd-arch@freebsd.org  Tue Aug  4 19:01:00 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 46D5D9957BF
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Tue,  4 Aug 2015 19:01:00 +0000 (UTC)
 (envelope-from jmg@gold.funkthat.com)
Received: from gold.funkthat.com (gate2.funkthat.com [208.87.223.18])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "gold.funkthat.com", Issuer "gold.funkthat.com" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2C3681DEF;
 Tue,  4 Aug 2015 19:00:59 +0000 (UTC)
 (envelope-from jmg@gold.funkthat.com)
Received: from gold.funkthat.com (localhost [127.0.0.1])
 by gold.funkthat.com (8.14.5/8.14.5) with ESMTP id t74J0xa0031776
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Tue, 4 Aug 2015 12:00:59 -0700 (PDT)
 (envelope-from jmg@gold.funkthat.com)
Received: (from jmg@localhost)
 by gold.funkthat.com (8.14.5/8.14.5/Submit) id t74J0xmq031775;
 Tue, 4 Aug 2015 12:00:59 -0700 (PDT) (envelope-from jmg)
Date: Tue, 4 Aug 2015 12:00:59 -0700
From: John-Mark Gurney <jmg@funkthat.com>
To: John Baldwin <jhb@freebsd.org>
Cc: "'freebsd-arch'" <freebsd-arch@freebsd.org>
Subject: Re: Supporting cross-debugging vmcores in libkvm
Message-ID: <20150804190058.GM78154@funkthat.com>
References: <3121152.ujdxFEovO3@ralph.baldwin.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <3121152.ujdxFEovO3@ralph.baldwin.cx>
X-Operating-System: FreeBSD 9.1-PRERELEASE amd64
X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88  9322 9CB1 8F74 6D3F A396
X-Files: The truth is out there
X-URL: http://resnet.uoregon.edu/~gurney_j/
X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html
X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE
X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger?
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (gold.funkthat.com [127.0.0.1]); Tue, 04 Aug 2015 12:00:59 -0700 (PDT)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 04 Aug 2015 19:01:00 -0000

John Baldwin wrote this message on Tue, Aug 04, 2015 at 10:56 -0700:
> Many debuggers (recent gdb and lldb) support cross-architecture debugging 
> just fine.  My current WIP port of kgdb to gdb7 supports cross-debugging for
> remote targets already, but I wanted it to also support cross-debugging for
> vmcores.

Have you looked at the work the my GSoC student, Daniel Lovasko, is
doing:
https://wiki.freebsd.org/SummerOfCode2015/TypeAwareKernelVirtualMemoryAccess

This uses libctf to completely abstract out the accessing of data
in libkvm so that it can be used w/ any arch as long as you have ctf
data...  This means you could use netstat on amd64 on an armeb vmcore
w/o issues...

It does look like some of this is still useful, but want to make sure
that we aren't reproducing tons of work...

For example, he's working on procstat right now:
https://github.com/lovasko/taprocstat

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."

From owner-freebsd-arch@freebsd.org  Wed Aug  5 00:09:43 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 47A3A9B3950
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Wed,  5 Aug 2015 00:09:43 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 26432198B
 for <freebsd-arch@freebsd.org>; Wed,  5 Aug 2015 00:09:43 +0000 (UTC)
 (envelope-from jhb@freebsd.org)
Received: from ralph.baldwin.cx (75-48-78-19.lightspeed.cncrca.sbcglobal.net
 [75.48.78.19])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 3A42FB926;
 Tue,  4 Aug 2015 20:09:42 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: John-Mark Gurney <jmg@funkthat.com>
Cc: 'freebsd-arch' <freebsd-arch@freebsd.org>
Subject: Re: Supporting cross-debugging vmcores in libkvm
Date: Tue, 04 Aug 2015 17:09:39 -0700
Message-ID: <2016463.jQeq4BdiyV@ralph.baldwin.cx>
User-Agent: KMail/4.14.3 (FreeBSD/10.2-PRERELEASE; KDE/4.14.3; amd64; ; )
In-Reply-To: <20150804190058.GM78154@funkthat.com>
References: <3121152.ujdxFEovO3@ralph.baldwin.cx>
 <20150804190058.GM78154@funkthat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Tue, 04 Aug 2015 20:09:42 -0400 (EDT)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Aug 2015 00:09:43 -0000

On Tuesday, August 04, 2015 12:00:59 PM John-Mark Gurney wrote:
> John Baldwin wrote this message on Tue, Aug 04, 2015 at 10:56 -0700:
> > Many debuggers (recent gdb and lldb) support cross-architecture debugging 
> > just fine.  My current WIP port of kgdb to gdb7 supports cross-debugging for
> > remote targets already, but I wanted it to also support cross-debugging for
> > vmcores.
> 
> Have you looked at the work the my GSoC student, Daniel Lovasko, is
> doing:
> https://wiki.freebsd.org/SummerOfCode2015/TypeAwareKernelVirtualMemoryAccess
> 
> This uses libctf to completely abstract out the accessing of data
> in libkvm so that it can be used w/ any arch as long as you have ctf
> data...  This means you could use netstat on amd64 on an armeb vmcore
> w/o issues...
> 
> It does look like some of this is still useful, but want to make sure
> that we aren't reproducing tons of work...
> 
> For example, he's working on procstat right now:
> https://github.com/lovasko/taprocstat

That doesn't seem to address the need of how you parse the actual vmcore
file itself to resolve a virtual address to a location in the vmcore file.
(e.g. on "plain" dumps on i386 this means walking the page tables whereas
for minidumps it means parsing a special set of PTEs and bitmap at the
start of the file).

To be clear, all that my work enables is doing a kvm_read() of a foreign
vmcore.  All the logic to decide how many bytes to read and at what
address (and then decoding those appropriately) happens in the debugger
(gdb/lldb, etc.).  The project here seems to be using CTF instead of
dwarf to do the sort of things the debugger does when you 'p *foo', but
you still need a way to find the 'foo' in the vmcore file.

-- 
John Baldwin

From owner-freebsd-arch@freebsd.org  Wed Aug  5 13:14:52 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9E3389B4667
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Wed,  5 Aug 2015 13:14:52 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org
 [IPv6:2001:1900:2254:206a::50:5])
 by mx1.freebsd.org (Postfix) with ESMTP id 827A61639
 for <freebsd-arch@freebsd.org>; Wed,  5 Aug 2015 13:14:52 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: by mailman.ysv.freebsd.org (Postfix)
 id 7FAB09B4666; Wed,  5 Aug 2015 13:14:52 +0000 (UTC)
Delivered-To: arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7F2A69B4665
 for <arch@mailman.ysv.freebsd.org>; Wed,  5 Aug 2015 13:14:52 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id D37F81638
 for <arch@freebsd.org>; Wed,  5 Aug 2015 13:14:51 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id t75DEiph045782
 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO)
 for <arch@freebsd.org>; Wed, 5 Aug 2015 16:14:44 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.9.2 kib.kiev.ua t75DEiph045782
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id t75DEit6045781
 for arch@freebsd.org; Wed, 5 Aug 2015 16:14:44 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Wed, 5 Aug 2015 16:14:44 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: arch@freebsd.org
Subject: The kern.kstack_pages tunable for some architectures
Message-ID: <20150805131444.GY2072@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Aug 2015 13:14:52 -0000

Patch at the end of the message adds kern.kstack_pages tunable for the
amd64, arm, i386, and powerpc architectures. I tested it on amd64 and
i386. From the visual inspection, it should work on arm and powerpc, on
all listed arches (except i386) the thread0 kstack is initialized after
the init_param1() is done. For amd64 this is ensured in the patch, for
i386 the TD0_KSTACK_PAGES param.h define defines the thread0 stack size,
it is impossible to use environment from the locore.

What makes me wonder is the USPACE_SVC_STACK_TOP define for arm and
the USPACE define for powerpc. They use the global value (KSTACK_PAGES
before the patch, kstack_pages after) to calculate the address of pcb,
which is wrong for non-default stack size.

I gave up on arm64 and sparc64, because they size statically defined
objects from the KSTACK_PAGES. I do not understand the arches bootstrap
to touch the code. For mips, there is even more wondering use of
KSTACK_PAGES to size the store of the kstack ptes in the thread md
struct, which should make the same problems as the USPACE_SVC_STACK_TOP
and USPACE.

Does anybody have opinion on the change ?  Could somebody test at least
some arm boards and powerpc ?

diff --git a/sys/amd64/amd64/genassym.c b/sys/amd64/amd64/genassym.c
index 5b1e089..d087fdc 100644
--- a/sys/amd64/amd64/genassym.c
+++ b/sys/amd64/amd64/genassym.c
@@ -93,7 +93,6 @@ ASSYM(TDP_KTHREAD, TDP_KTHREAD);
 ASSYM(V_TRAP, offsetof(struct vmmeter, v_trap));
 ASSYM(V_SYSCALL, offsetof(struct vmmeter, v_syscall));
 ASSYM(V_INTR, offsetof(struct vmmeter, v_intr));
-ASSYM(KSTACK_PAGES, KSTACK_PAGES);
 ASSYM(PAGE_SIZE, PAGE_SIZE);
 ASSYM(NPTEPG, NPTEPG);
 ASSYM(NPDEPG, NPDEPG);
diff --git a/sys/amd64/amd64/machdep.c b/sys/amd64/amd64/machdep.c
index a571390..f579f98 100644
--- a/sys/amd64/amd64/machdep.c
+++ b/sys/amd64/amd64/machdep.c
@@ -1516,12 +1516,6 @@ hammer_time(u_int64_t modulep, u_int64_t physfree)
 	char *env;
 	size_t kstack0_sz;
 
-	thread0.td_kstack = physfree + KERNBASE;
-	thread0.td_kstack_pages = KSTACK_PAGES;
-	kstack0_sz = thread0.td_kstack_pages * PAGE_SIZE;
-	bzero((void *)thread0.td_kstack, kstack0_sz);
-	physfree += kstack0_sz;
-
 	/*
  	 * This may be done better later if it gets more high level
  	 * components in it. If so just link td->td_proc here.
@@ -1533,6 +1527,12 @@ hammer_time(u_int64_t modulep, u_int64_t physfree)
 	/* Init basic tunables, hz etc */
 	init_param1();
 
+	thread0.td_kstack = physfree + KERNBASE;
+	thread0.td_kstack_pages = kstack_pages;
+	kstack0_sz = thread0.td_kstack_pages * PAGE_SIZE;
+	bzero((void *)thread0.td_kstack, kstack0_sz);
+	physfree += kstack0_sz;
+
 	/*
 	 * make gdt memory segments
 	 */
diff --git a/sys/amd64/amd64/mp_machdep.c b/sys/amd64/amd64/mp_machdep.c
index a2ca9e2..0562ca4 100644
--- a/sys/amd64/amd64/mp_machdep.c
+++ b/sys/amd64/amd64/mp_machdep.c
@@ -348,7 +348,7 @@ native_start_all_aps(void)
 
 		/* allocate and set up an idle stack data page */
 		bootstacks[cpu] = (void *)kmem_malloc(kernel_arena,
-		    KSTACK_PAGES * PAGE_SIZE, M_WAITOK | M_ZERO);
+		    kstack_pages * PAGE_SIZE, M_WAITOK | M_ZERO);
 		doublefault_stack = (char *)kmem_malloc(kernel_arena,
 		    PAGE_SIZE, M_WAITOK | M_ZERO);
 		nmi_stack = (char *)kmem_malloc(kernel_arena, PAGE_SIZE,
@@ -356,7 +356,7 @@ native_start_all_aps(void)
 		dpcpu = (void *)kmem_malloc(kernel_arena, DPCPU_SIZE,
 		    M_WAITOK | M_ZERO);
 
-		bootSTK = (char *)bootstacks[cpu] + KSTACK_PAGES * PAGE_SIZE - 8;
+		bootSTK = (char *)bootstacks[cpu] + kstack_pages * PAGE_SIZE - 8;
 		bootAP = cpu;
 
 		/* attempt to start the Application Processor */
diff --git a/sys/arm/arm/machdep.c b/sys/arm/arm/machdep.c
index 67e081d..a664ac4 100644
--- a/sys/arm/arm/machdep.c
+++ b/sys/arm/arm/machdep.c
@@ -1066,7 +1066,7 @@ init_proc0(vm_offset_t kstack)
 	proc_linkup0(&proc0, &thread0);
 	thread0.td_kstack = kstack;
 	thread0.td_pcb = (struct pcb *)
-		(thread0.td_kstack + KSTACK_PAGES * PAGE_SIZE) - 1;
+		(thread0.td_kstack + kstack_pages * PAGE_SIZE) - 1;
 	thread0.td_pcb->pcb_flags = 0;
 	thread0.td_pcb->pcb_vfpcpu = -1;
 	thread0.td_pcb->pcb_vfpstate.fpscr = VFPSCR_DN | VFPSCR_FZ;
@@ -1360,7 +1360,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE * MAXCPU);
 	valloc_pages(abtstack, ABT_STACK_SIZE * MAXCPU);
 	valloc_pages(undstack, UND_STACK_SIZE * MAXCPU);
-	valloc_pages(kernelstack, KSTACK_PAGES * MAXCPU);
+	valloc_pages(kernelstack, kstack_pages * MAXCPU);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 
 	/*
@@ -1614,7 +1614,7 @@ initarm(struct arm_boot_params *abp)
 	irqstack    = pmap_preboot_get_vpages(IRQ_STACK_SIZE * MAXCPU);
 	abtstack    = pmap_preboot_get_vpages(ABT_STACK_SIZE * MAXCPU);
 	undstack    = pmap_preboot_get_vpages(UND_STACK_SIZE * MAXCPU );
-	kernelstack = pmap_preboot_get_vpages(KSTACK_PAGES * MAXCPU);
+	kernelstack = pmap_preboot_get_vpages(kstack_pages * MAXCPU);
 
 	/* Allocate message buffer. */
 	msgbufp = (void *)pmap_preboot_get_vpages(
diff --git a/sys/arm/at91/at91_machdep.c b/sys/arm/at91/at91_machdep.c
index 62edfa6..2d5dda2 100644
--- a/sys/arm/at91/at91_machdep.c
+++ b/sys/arm/at91/at91_machdep.c
@@ -512,7 +512,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE * MAXCPU);
 	valloc_pages(abtstack, ABT_STACK_SIZE * MAXCPU);
 	valloc_pages(undstack, UND_STACK_SIZE * MAXCPU);
-	valloc_pages(kernelstack, KSTACK_PAGES * MAXCPU);
+	valloc_pages(kernelstack, kstack_pages * MAXCPU);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 
 	/*
@@ -553,7 +553,7 @@ initarm(struct arm_boot_params *abp)
 	pmap_map_chunk(l1pagetable, undstack.pv_va, undstack.pv_pa,
 	    UND_STACK_SIZE * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
 	pmap_map_chunk(l1pagetable, kernelstack.pv_va, kernelstack.pv_pa,
-	    KSTACK_PAGES * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
+	    kstack_pages * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
 
 	pmap_map_chunk(l1pagetable, kernel_l1pt.pv_va, kernel_l1pt.pv_pa,
 	    L1_TABLE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_PAGETABLE);
diff --git a/sys/arm/cavium/cns11xx/econa_machdep.c b/sys/arm/cavium/cns11xx/econa_machdep.c
index 1532cec..1591053 100644
--- a/sys/arm/cavium/cns11xx/econa_machdep.c
+++ b/sys/arm/cavium/cns11xx/econa_machdep.c
@@ -222,7 +222,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE);
 	valloc_pages(abtstack, ABT_STACK_SIZE);
 	valloc_pages(undstack, UND_STACK_SIZE);
-	valloc_pages(kernelstack, KSTACK_PAGES);
+	valloc_pages(kernelstack, kstack_pages);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 
 	/*
@@ -260,7 +260,7 @@ initarm(struct arm_boot_params *abp)
 	pmap_map_chunk(l1pagetable, undstack.pv_va, undstack.pv_pa,
 	    UND_STACK_SIZE * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
 	pmap_map_chunk(l1pagetable, kernelstack.pv_va, kernelstack.pv_pa,
-	    KSTACK_PAGES * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
+	    kstack_pages * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
 
 	pmap_map_chunk(l1pagetable, kernel_l1pt.pv_va, kernel_l1pt.pv_pa,
 	    L1_TABLE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_PAGETABLE);
diff --git a/sys/arm/include/param.h b/sys/arm/include/param.h
index 6267154..d3aa01b 100644
--- a/sys/arm/include/param.h
+++ b/sys/arm/include/param.h
@@ -131,7 +131,7 @@
 #define KSTACK_GUARD_PAGES	1
 #endif /* !KSTACK_GUARD_PAGES */
 
-#define USPACE_SVC_STACK_TOP		(KSTACK_PAGES * PAGE_SIZE)
+#define USPACE_SVC_STACK_TOP		(kstack_pages * PAGE_SIZE)
 
 /*
  * Mach derived conversion macros
diff --git a/sys/arm/samsung/s3c2xx0/s3c24x0_machdep.c b/sys/arm/samsung/s3c2xx0/s3c24x0_machdep.c
index bdd6cc6..bd3c230 100644
--- a/sys/arm/samsung/s3c2xx0/s3c24x0_machdep.c
+++ b/sys/arm/samsung/s3c2xx0/s3c24x0_machdep.c
@@ -271,7 +271,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE);
 	valloc_pages(abtstack, ABT_STACK_SIZE);
 	valloc_pages(undstack, UND_STACK_SIZE);
-	valloc_pages(kernelstack, KSTACK_PAGES);
+	valloc_pages(kernelstack, kstack_pages);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 	/*
 	 * Now we start construction of the L1 page table
@@ -307,7 +307,7 @@ initarm(struct arm_boot_params *abp)
 	pmap_map_chunk(l1pagetable, undstack.pv_va, undstack.pv_pa,
 	    UND_STACK_SIZE * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
 	pmap_map_chunk(l1pagetable, kernelstack.pv_va, kernelstack.pv_pa,
-	    KSTACK_PAGES * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
+	    kstack_pages * PAGE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_CACHE);
 
 	pmap_map_chunk(l1pagetable, kernel_l1pt.pv_va, kernel_l1pt.pv_pa,
 	    L1_TABLE_SIZE, VM_PROT_READ|VM_PROT_WRITE, PTE_PAGETABLE);
diff --git a/sys/arm/xscale/i80321/ep80219_machdep.c b/sys/arm/xscale/i80321/ep80219_machdep.c
index 9881371..d93ed74 100644
--- a/sys/arm/xscale/i80321/ep80219_machdep.c
+++ b/sys/arm/xscale/i80321/ep80219_machdep.c
@@ -225,7 +225,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE);
 	valloc_pages(abtstack, ABT_STACK_SIZE);
 	valloc_pages(undstack, UND_STACK_SIZE);
-	valloc_pages(kernelstack, KSTACK_PAGES);
+	valloc_pages(kernelstack, kstack_pages);
 	alloc_pages(minidataclean.pv_pa, 1);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 	/*
diff --git a/sys/arm/xscale/i80321/iq31244_machdep.c b/sys/arm/xscale/i80321/iq31244_machdep.c
index 0df3609..52d94af 100644
--- a/sys/arm/xscale/i80321/iq31244_machdep.c
+++ b/sys/arm/xscale/i80321/iq31244_machdep.c
@@ -226,7 +226,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE);
 	valloc_pages(abtstack, ABT_STACK_SIZE);
 	valloc_pages(undstack, UND_STACK_SIZE);
-	valloc_pages(kernelstack, KSTACK_PAGES);
+	valloc_pages(kernelstack, kstack_pages);
 	alloc_pages(minidataclean.pv_pa, 1);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 	/*
diff --git a/sys/arm/xscale/i8134x/crb_machdep.c b/sys/arm/xscale/i8134x/crb_machdep.c
index 568be9f..138ed09 100644
--- a/sys/arm/xscale/i8134x/crb_machdep.c
+++ b/sys/arm/xscale/i8134x/crb_machdep.c
@@ -225,7 +225,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE);
 	valloc_pages(abtstack, ABT_STACK_SIZE);
 	valloc_pages(undstack, UND_STACK_SIZE);
-	valloc_pages(kernelstack, KSTACK_PAGES);
+	valloc_pages(kernelstack, kstack_pages);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 	/*
 	 * Now we start construction of the L1 page table
diff --git a/sys/arm/xscale/ixp425/avila_machdep.c b/sys/arm/xscale/ixp425/avila_machdep.c
index f37aa29..0d5d9bb 100644
--- a/sys/arm/xscale/ixp425/avila_machdep.c
+++ b/sys/arm/xscale/ixp425/avila_machdep.c
@@ -295,7 +295,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE);
 	valloc_pages(abtstack, ABT_STACK_SIZE);
 	valloc_pages(undstack, UND_STACK_SIZE);
-	valloc_pages(kernelstack, KSTACK_PAGES);
+	valloc_pages(kernelstack, kstack_pages);
 	alloc_pages(minidataclean.pv_pa, 1);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 
diff --git a/sys/arm/xscale/pxa/pxa_machdep.c b/sys/arm/xscale/pxa/pxa_machdep.c
index 4480c95..41e49c3 100644
--- a/sys/arm/xscale/pxa/pxa_machdep.c
+++ b/sys/arm/xscale/pxa/pxa_machdep.c
@@ -206,7 +206,7 @@ initarm(struct arm_boot_params *abp)
 	valloc_pages(irqstack, IRQ_STACK_SIZE);
 	valloc_pages(abtstack, ABT_STACK_SIZE);
 	valloc_pages(undstack, UND_STACK_SIZE);
-	valloc_pages(kernelstack, KSTACK_PAGES);
+	valloc_pages(kernelstack, kstack_pages);
 	alloc_pages(minidataclean.pv_pa, 1);
 	valloc_pages(msgbufpv, round_page(msgbufsize) / PAGE_SIZE);
 	/*
diff --git a/sys/ddb/db_ps.c b/sys/ddb/db_ps.c
index 553c22e..f38c89f 100644
--- a/sys/ddb/db_ps.c
+++ b/sys/ddb/db_ps.c
@@ -462,7 +462,7 @@ db_findstack_cmd(db_expr_t addr, bool have_addr, db_expr_t dummy3 __unused,
 	for (ks_ce = kstack_cache; ks_ce != NULL;
 	     ks_ce = ks_ce->next_ks_entry) {
 		if ((vm_offset_t)ks_ce <= saddr && saddr < (vm_offset_t)ks_ce +
-		    PAGE_SIZE * KSTACK_PAGES) {
+		    PAGE_SIZE * kstack_pages) {
 			db_printf("Cached stack %p\n", ks_ce);
 			return;
 		}
diff --git a/sys/i386/i386/genassym.c b/sys/i386/i386/genassym.c
index 6a00d23..3087834 100644
--- a/sys/i386/i386/genassym.c
+++ b/sys/i386/i386/genassym.c
@@ -101,8 +101,6 @@ ASSYM(TDF_NEEDRESCHED, TDF_NEEDRESCHED);
 ASSYM(V_TRAP, offsetof(struct vmmeter, v_trap));
 ASSYM(V_SYSCALL, offsetof(struct vmmeter, v_syscall));
 ASSYM(V_INTR, offsetof(struct vmmeter, v_intr));
-/* ASSYM(UPAGES, UPAGES);*/
-ASSYM(KSTACK_PAGES, KSTACK_PAGES);
 ASSYM(TD0_KSTACK_PAGES, TD0_KSTACK_PAGES);
 ASSYM(PAGE_SIZE, PAGE_SIZE);
 ASSYM(NPTEPG, NPTEPG);
diff --git a/sys/i386/i386/mp_machdep.c b/sys/i386/i386/mp_machdep.c
index 0942523..4812cb0 100644
--- a/sys/i386/i386/mp_machdep.c
+++ b/sys/i386/i386/mp_machdep.c
@@ -348,7 +348,7 @@ start_all_aps(void)
 
 		/* allocate and set up a boot stack data page */
 		bootstacks[cpu] =
-		    (char *)kmem_malloc(kernel_arena, KSTACK_PAGES * PAGE_SIZE,
+		    (char *)kmem_malloc(kernel_arena, kstack_pages * PAGE_SIZE,
 		    M_WAITOK | M_ZERO);
 		dpcpu = (void *)kmem_malloc(kernel_arena, DPCPU_SIZE,
 		    M_WAITOK | M_ZERO);
@@ -360,7 +360,8 @@ start_all_aps(void)
 		outb(CMOS_DATA, BIOS_WARM);	/* 'warm-start' */
 #endif
 
-		bootSTK = (char *)bootstacks[cpu] + KSTACK_PAGES * PAGE_SIZE - 4;
+		bootSTK = (char *)bootstacks[cpu] + kstack_pages *
+		    PAGE_SIZE - 4;
 		bootAP = cpu;
 
 		/* attempt to start the Application Processor */
diff --git a/sys/i386/i386/sys_machdep.c b/sys/i386/i386/sys_machdep.c
index 0928b72..dc367a6 100644
--- a/sys/i386/i386/sys_machdep.c
+++ b/sys/i386/i386/sys_machdep.c
@@ -275,7 +275,7 @@ i386_extend_pcb(struct thread *td)
 	ext = (struct pcb_ext *)kmem_malloc(kernel_arena, ctob(IOPAGES+1),
 	    M_WAITOK | M_ZERO);
 	/* -16 is so we can convert a trapframe into vm86trapframe inplace */
-	ext->ext_tss.tss_esp0 = td->td_kstack + ctob(KSTACK_PAGES) -
+	ext->ext_tss.tss_esp0 = td->td_kstack + ctob(td->td_kstack_pages) -
 	    sizeof(struct pcb) - 16;
 	ext->ext_tss.tss_ss0 = GSEL(GDATA_SEL, SEL_KPL);
 	/*
diff --git a/sys/i386/include/privatespace.h b/sys/i386/include/privatespace.h
deleted file mode 100644
index 5eb54c2..0000000
--- a/sys/i386/include/privatespace.h
+++ /dev/null
@@ -1,49 +0,0 @@
-/*-
- * Copyright (c) Peter Wemm
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- *
- * $FreeBSD$
- */
-
-#ifndef _MACHINE_PRIVATESPACE_H_
-#define _MACHINE_PRIVATESPACE_H_
-
-/*
- * This is the upper (0xff800000) address space layout that is per-cpu.
- * It is setup in locore.s and pmap.c for the BSP and in mp_machdep.c for
- * each AP.  This is only applicable to the x86 SMP kernel.
- */
-struct privatespace {
-	/* page 0 - data page */
-	struct	pcpu pcpu;
-	char	__filler0[PAGE_SIZE - sizeof(struct pcpu)];
-
-	/* page 1 - idle stack (KSTACK_PAGES pages) */
-	char	idlekstack[KSTACK_PAGES * PAGE_SIZE];
-	/* page 1+KSTACK_PAGES... */
-};
-
-extern struct privatespace SMP_prvspace[];
-
-#endif	/* ! _MACHINE_PRIVATESPACE_H_ */
diff --git a/sys/kern/kern_fork.c b/sys/kern/kern_fork.c
index 85dbd94..4aa5314 100644
--- a/sys/kern/kern_fork.c
+++ b/sys/kern/kern_fork.c
@@ -832,7 +832,7 @@ fork1(struct thread *td, int flags, int pages, struct proc **procp,
 	mem_charged = 0;
 	vm2 = NULL;
 	if (pages == 0)
-		pages = KSTACK_PAGES;
+		pages = kstack_pages;
 	/* Allocate new proc. */
 	newproc = uma_zalloc(proc_zone, M_WAITOK);
 	td2 = FIRST_THREAD_IN_PROC(newproc);
diff --git a/sys/kern/subr_param.c b/sys/kern/subr_param.c
index 5043a57..36608d1 100644
--- a/sys/kern/subr_param.c
+++ b/sys/kern/subr_param.c
@@ -159,6 +159,9 @@ void
 init_param1(void)
 {
 
+#if !defined(__mips__) && !defined(__arm64__) && !defined(__sparc64__)
+	TUNABLE_INT_FETCH("kern.kstack_pages", &kstack_pages);
+#endif
 	hz = -1;
 	TUNABLE_INT_FETCH("kern.hz", &hz);
 	if (hz == -1)
diff --git a/sys/pc98/include/privatespace.h b/sys/pc98/include/privatespace.h
deleted file mode 100644
index 5db57c3..0000000
--- a/sys/pc98/include/privatespace.h
+++ /dev/null
@@ -1,6 +0,0 @@
-/*-
- * This file is in the public domain.
- */
-/* $FreeBSD$ */
-
-#include <i386/privatespace.h>
diff --git a/sys/powerpc/aim/mmu_oea.c b/sys/powerpc/aim/mmu_oea.c
index 4734738..d45b34e 100644
--- a/sys/powerpc/aim/mmu_oea.c
+++ b/sys/powerpc/aim/mmu_oea.c
@@ -932,13 +932,13 @@ moea_bootstrap(mmu_t mmup, vm_offset_t kernelstart, vm_offset_t kernelend)
 	 * Allocate a kernel stack with a guard page for thread0 and map it
 	 * into the kernel page map.
 	 */
-	pa = moea_bootstrap_alloc(KSTACK_PAGES * PAGE_SIZE, PAGE_SIZE);
+	pa = moea_bootstrap_alloc(kstack_pages * PAGE_SIZE, PAGE_SIZE);
 	va = virtual_avail + KSTACK_GUARD_PAGES * PAGE_SIZE;
-	virtual_avail = va + KSTACK_PAGES * PAGE_SIZE;
+	virtual_avail = va + kstack_pages * PAGE_SIZE;
 	CTR2(KTR_PMAP, "moea_bootstrap: kstack0 at %#x (%#x)", pa, va);
 	thread0.td_kstack = va;
-	thread0.td_kstack_pages = KSTACK_PAGES;
-	for (i = 0; i < KSTACK_PAGES; i++) {
+	thread0.td_kstack_pages = kstack_pages;
+	for (i = 0; i < kstack_pages; i++) {
 		moea_kenter(mmup, va, pa);
 		pa += PAGE_SIZE;
 		va += PAGE_SIZE;
diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c
index 44caec6..3766d86 100644
--- a/sys/powerpc/aim/mmu_oea64.c
+++ b/sys/powerpc/aim/mmu_oea64.c
@@ -917,13 +917,13 @@ moea64_late_bootstrap(mmu_t mmup, vm_offset_t kernelstart, vm_offset_t kernelend
 	 * Allocate a kernel stack with a guard page for thread0 and map it
 	 * into the kernel page map.
 	 */
-	pa = moea64_bootstrap_alloc(KSTACK_PAGES * PAGE_SIZE, PAGE_SIZE);
+	pa = moea64_bootstrap_alloc(kstack_pages * PAGE_SIZE, PAGE_SIZE);
 	va = virtual_avail + KSTACK_GUARD_PAGES * PAGE_SIZE;
-	virtual_avail = va + KSTACK_PAGES * PAGE_SIZE;
+	virtual_avail = va + kstack_pages * PAGE_SIZE;
 	CTR2(KTR_PMAP, "moea64_bootstrap: kstack0 at %#x (%#x)", pa, va);
 	thread0.td_kstack = va;
-	thread0.td_kstack_pages = KSTACK_PAGES;
-	for (i = 0; i < KSTACK_PAGES; i++) {
+	thread0.td_kstack_pages = kstack_pages;
+	for (i = 0; i < kstack_pages; i++) {
 		moea64_kenter(mmup, va, pa);
 		pa += PAGE_SIZE;
 		va += PAGE_SIZE;
diff --git a/sys/powerpc/booke/pmap.c b/sys/powerpc/booke/pmap.c
index 275ae8d..223500c 100644
--- a/sys/powerpc/booke/pmap.c
+++ b/sys/powerpc/booke/pmap.c
@@ -1207,7 +1207,7 @@ mmu_booke_bootstrap(mmu_t mmu, vm_offset_t start, vm_offset_t kernelend)
 	/* Steal physical memory for kernel stack from the end */
 	/* of the first avail region                           */
 	/*******************************************************/
-	kstack0_sz = KSTACK_PAGES * PAGE_SIZE;
+	kstack0_sz = kstack_pages * PAGE_SIZE;
 	kstack0_phys = availmem_regions[0].mr_start +
 	    availmem_regions[0].mr_size;
 	kstack0_phys -= kstack0_sz;
@@ -1312,7 +1312,7 @@ mmu_booke_bootstrap(mmu_t mmu, vm_offset_t start, vm_offset_t kernelend)
 	/* Enter kstack0 into kernel map, provide guard page */
 	kstack0 = virtual_avail + KSTACK_GUARD_PAGES * PAGE_SIZE;
 	thread0.td_kstack = kstack0;
-	thread0.td_kstack_pages = KSTACK_PAGES;
+	thread0.td_kstack_pages = kstack_pages;
 
 	debugf("kstack_sz = 0x%08x\n", kstack0_sz);
 	debugf("kstack0_phys at 0x%08x - 0x%08x\n",
@@ -1320,7 +1320,7 @@ mmu_booke_bootstrap(mmu_t mmu, vm_offset_t start, vm_offset_t kernelend)
 	debugf("kstack0 at 0x%08x - 0x%08x\n", kstack0, kstack0 + kstack0_sz);
 	
 	virtual_avail += KSTACK_GUARD_PAGES * PAGE_SIZE + kstack0_sz;
-	for (i = 0; i < KSTACK_PAGES; i++) {
+	for (i = 0; i < kstack_pages; i++) {
 		mmu_booke_kenter(mmu, kstack0, kstack0_phys);
 		kstack0 += PAGE_SIZE;
 		kstack0_phys += PAGE_SIZE;
diff --git a/sys/powerpc/include/param.h b/sys/powerpc/include/param.h
index 5c25e8a..4780a68 100644
--- a/sys/powerpc/include/param.h
+++ b/sys/powerpc/include/param.h
@@ -111,7 +111,7 @@
 #endif
 #endif
 #define	KSTACK_GUARD_PAGES	1	/* pages of kstack guard; 0 disables */
-#define	USPACE		(KSTACK_PAGES * PAGE_SIZE)	/* total size of pcb */
+#define	USPACE		(kstack_pages * PAGE_SIZE)	/* total size of pcb */
 
 /*
  * Mach derived conversion macros
diff --git a/sys/vm/vm_glue.c b/sys/vm/vm_glue.c
index 1ff17c2..92ee794 100644
--- a/sys/vm/vm_glue.c
+++ b/sys/vm/vm_glue.c
@@ -327,11 +327,11 @@ vm_thread_new(struct thread *td, int pages)
 
 	/* Bounds check */
 	if (pages <= 1)
-		pages = KSTACK_PAGES;
+		pages = kstack_pages;
 	else if (pages > KSTACK_MAX_PAGES)
 		pages = KSTACK_MAX_PAGES;
 
-	if (pages == KSTACK_PAGES) {
+	if (pages == kstack_pages) {
 		mtx_lock(&kstack_cache_mtx);
 		if (kstack_cache != NULL) {
 			ks_ce = kstack_cache;
@@ -340,7 +340,7 @@ vm_thread_new(struct thread *td, int pages)
 
 			td->td_kstack_obj = ks_ce->ksobj;
 			td->td_kstack = (vm_offset_t)ks_ce;
-			td->td_kstack_pages = KSTACK_PAGES;
+			td->td_kstack_pages = kstack_pages;
 			return (1);
 		}
 		mtx_unlock(&kstack_cache_mtx);
@@ -444,7 +444,7 @@ vm_thread_dispose(struct thread *td)
 	ks = td->td_kstack;
 	td->td_kstack = 0;
 	td->td_kstack_pages = 0;
-	if (pages == KSTACK_PAGES && kstacks <= kstack_cache_size) {
+	if (pages == kstack_pages && kstacks <= kstack_cache_size) {
 		ks_ce = (struct kstack_cache_entry *)ks;
 		ks_ce->ksobj = ksobj;
 		mtx_lock(&kstack_cache_mtx);
@@ -471,7 +471,7 @@ vm_thread_stack_lowmem(void *nulll)
 		ks_ce = ks_ce->next_ks_entry;
 
 		vm_thread_stack_dispose(ks_ce1->ksobj, (vm_offset_t)ks_ce1,
-		    KSTACK_PAGES);
+		    kstack_pages);
 	}
 }
 
diff --git a/sys/x86/xen/pv.c b/sys/x86/xen/pv.c
index 6b913fb..50d2e76 100644
--- a/sys/x86/xen/pv.c
+++ b/sys/x86/xen/pv.c
@@ -215,7 +215,7 @@ start_xen_ap(int cpu)
 {
 	struct vcpu_guest_context *ctxt;
 	int ms, cpus = mp_naps;
-	const size_t stacksize = KSTACK_PAGES * PAGE_SIZE;
+	const size_t stacksize = kstack_pages * PAGE_SIZE;
 
 	/* allocate and set up an idle stack data page */
 	bootstacks[cpu] =
@@ -227,7 +227,7 @@ start_xen_ap(int cpu)
 	dpcpu =
 	    (void *)kmem_malloc(kernel_arena, DPCPU_SIZE, M_WAITOK | M_ZERO);
 
-	bootSTK = (char *)bootstacks[cpu] + KSTACK_PAGES * PAGE_SIZE - 8;
+	bootSTK = (char *)bootstacks[cpu] + kstack_pages * PAGE_SIZE - 8;
 	bootAP = cpu;
 
 	ctxt = malloc(sizeof(*ctxt), M_TEMP, M_WAITOK | M_ZERO);

From owner-freebsd-arch@freebsd.org  Fri Aug  7 13:38:48 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9957A9B298D
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Fri,  7 Aug 2015 13:38:48 +0000 (UTC)
 (envelope-from glebius@FreeBSD.org)
Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org
 [IPv6:2001:1900:2254:206a::50:5])
 by mx1.freebsd.org (Postfix) with ESMTP id 7D9AFD6D
 for <freebsd-arch@freebsd.org>; Fri,  7 Aug 2015 13:38:48 +0000 (UTC)
 (envelope-from glebius@FreeBSD.org)
Received: by mailman.ysv.freebsd.org (Postfix)
 id 7CDFC9B298C; Fri,  7 Aug 2015 13:38:48 +0000 (UTC)
Delivered-To: arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7C47D9B298B
 for <arch@mailman.ysv.freebsd.org>; Fri,  7 Aug 2015 13:38:48 +0000 (UTC)
 (envelope-from glebius@FreeBSD.org)
Received: from cell.glebius.int.ru (glebius.int.ru [81.19.69.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "cell.glebius.int.ru", Issuer "cell.glebius.int.ru" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id B2F45D6C
 for <arch@FreeBSD.org>; Fri,  7 Aug 2015 13:38:47 +0000 (UTC)
 (envelope-from glebius@FreeBSD.org)
Received: from cell.glebius.int.ru (localhost [127.0.0.1])
 by cell.glebius.int.ru (8.15.2/8.15.2) with ESMTPS id t77DciJ6055050
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO)
 for <arch@FreeBSD.org>; Fri, 7 Aug 2015 16:38:44 +0300 (MSK)
 (envelope-from glebius@FreeBSD.org)
Received: (from glebius@localhost)
 by cell.glebius.int.ru (8.15.2/8.15.2/Submit) id t77DciOe055049
 for arch@FreeBSD.org; Fri, 7 Aug 2015 16:38:44 +0300 (MSK)
 (envelope-from glebius@FreeBSD.org)
X-Authentication-Warning: cell.glebius.int.ru: glebius set sender to
 glebius@FreeBSD.org using -f
Date: Fri, 7 Aug 2015 16:38:44 +0300
From: Gleb Smirnoff <glebius@FreeBSD.org>
To: arch@FreeBSD.org
Subject: Re: more strict KPI for vm_pager_get_pages()
Message-ID: <20150807133844.GS889@FreeBSD.org>
References: <20150430142408.GS546@nginx.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="sLx0z+5FKKtIVDwd"
Content-Disposition: inline
In-Reply-To: <20150430142408.GS546@nginx.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Aug 2015 13:38:48 -0000


--sLx0z+5FKKtIVDwd
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

  Hi!

  This is followup on older email:

https://lists.freebsd.org/pipermail/freebsd-arch/2015-April/017154.html

The preparatory commits were already checked in, and I'm going
to push next week the rest, since the whole story is already
several months due.

Planned changes:

o vm_pager_get_pages() accepts array of pages, and treats all pages
  equally. Notion of reqpage goes away.
o The array span validity must be checked before with vm_pager_has_page().
o All pages must be xbusied on enter.
o All pages will be left xbusied on exit. This closes possible races,
  allows to pass in wired pages (for any pager). And it leaves
  the caller to decide what to do with pages: vm_page_active,
  vm_page_deactivate, vm_page_flash or just vm_page_free them.

The patch has been tested by me and pho@ with his stress2 test.

I know, that there are two comments from kib@ on the patch.

1) There could be a user of KPI who would be fine with partial success.

My answer: right now there is none, and if one emerges, the code can
be easily adopted to return VM_PAGER_ERROR, but still mark validated
pages as valid. The user of KPI then can scan the array and take valid
pages. So, the patch doesn't put any obstacles on appearance of such
user.

2) Filesystems can do short reads by design, and thus fail to validate
   the entire array.

My answer: yes, that's true. By design NFS, SMBFS and FUSE should be
able to return short reads. However, the VOP_GETPAGES methods for all
three FSes right now do not have any code that would support that. So,
it looks like there is an open issue with these filesystems, not related
to my patch. When this issue is addressed in any of aforementioned FSes,
the VOP_GETPAGES should be fixed to do several I/Os in case of short
reads.

-- 
Totus tuus, Glebius.

--sLx0z+5FKKtIVDwd
Content-Type: text/x-diff; charset=us-ascii
Content-Disposition: attachment; filename="vm_pager_get_pages.diff"

Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	(revision 286413)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c	(working copy)
@@ -5712,12 +5712,12 @@ ioflags(int ioflags)
 }
 
 static int
-zfs_getpages(struct vnode *vp, vm_page_t *m, int count, int reqpage)
+zfs_getpages(struct vnode *vp, vm_page_t *m, int count)
 {
 	znode_t *zp = VTOZ(vp);
 	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
 	objset_t *os = zp->z_zfsvfs->z_os;
-	vm_page_t mfirst, mlast, mreq;
+	vm_page_t mlast;
 	vm_object_t object;
 	caddr_t va;
 	struct sf_buf *sf;
@@ -5730,78 +5730,37 @@ static int
 	ZFS_VERIFY_ZP(zp);
 
 	pcount = OFF_TO_IDX(round_page(count));
-	mreq = m[reqpage];
-	object = mreq->object;
-	error = 0;
 
-	if (pcount > 1 && zp->z_blksz > PAGESIZE) {
-		startoff = rounddown(IDX_TO_OFF(mreq->pindex), zp->z_blksz);
-		reqstart = OFF_TO_IDX(round_page(startoff));
-		if (reqstart < m[0]->pindex)
-			reqstart = 0;
-		else
-			reqstart = reqstart - m[0]->pindex;
-		endoff = roundup(IDX_TO_OFF(mreq->pindex) + PAGE_SIZE,
-		    zp->z_blksz);
-		reqend = OFF_TO_IDX(trunc_page(endoff)) - 1;
-		if (reqend > m[pcount - 1]->pindex)
-			reqend = m[pcount - 1]->pindex;
-		reqsize = reqend - m[reqstart]->pindex + 1;
-		KASSERT(reqstart <= reqpage && reqpage < reqstart + reqsize,
-		    ("reqpage beyond [reqstart, reqstart + reqsize[ bounds"));
-	} else {
-		reqstart = reqpage;
-		reqsize = 1;
-	}
-	mfirst = m[reqstart];
-	mlast = m[reqstart + reqsize - 1];
-
 	zfs_vmobject_wlock(object);
-
-	for (i = 0; i < reqstart; i++) {
-		vm_page_lock(m[i]);
-		vm_page_free(m[i]);
-		vm_page_unlock(m[i]);
-	}
-	for (i = reqstart + reqsize; i < pcount; i++) {
-		vm_page_lock(m[i]);
-		vm_page_free(m[i]);
-		vm_page_unlock(m[i]);
-	}
-
-	if (mreq->valid && reqsize == 1) {
-		if (mreq->valid != VM_PAGE_BITS_ALL)
-			vm_page_zero_invalid(mreq, TRUE);
+	if (m[pcount - 1]->valid != 0 && --pcount == 0) {
 		zfs_vmobject_wunlock(object);
 		ZFS_EXIT(zfsvfs);
 		return (zfs_vm_pagerret_ok);
 	}
 
-	PCPU_INC(cnt.v_vnodein);
-	PCPU_ADD(cnt.v_vnodepgsin, reqsize);
+	object = m[0]->object;
+	mlast = m[pcount - 1];
 
-	if (IDX_TO_OFF(mreq->pindex) >= object->un_pager.vnp.vnp_size) {
-		for (i = reqstart; i < reqstart + reqsize; i++) {
-			if (i != reqpage) {
-				vm_page_lock(m[i]);
-				vm_page_free(m[i]);
-				vm_page_unlock(m[i]);
-			}
-		}
+	if (IDX_TO_OFF(mlast->pindex) >=
+	    object->un_pager.vnp.vnp_size) {
 		zfs_vmobject_wunlock(object);
 		ZFS_EXIT(zfsvfs);
 		return (zfs_vm_pagerret_bad);
 	}
 
+	PCPU_INC(cnt.v_vnodein);
+	PCPU_ADD(cnt.v_vnodepgsin, reqsize);
+
 	lsize = PAGE_SIZE;
 	if (IDX_TO_OFF(mlast->pindex) + lsize > object->un_pager.vnp.vnp_size)
-		lsize = object->un_pager.vnp.vnp_size - IDX_TO_OFF(mlast->pindex);
-
+		lsize = object->un_pager.vnp.vnp_size -
+		    IDX_TO_OFF(mlast->pindex);
 	zfs_vmobject_wunlock(object);
 
-	for (i = reqstart; i < reqstart + reqsize; i++) {
+	error = 0;
+	for (i = 0; i < pcount; i++) {
 		size = PAGE_SIZE;
-		if (i == (reqstart + reqsize - 1))
+		if (i == pcount - 1)
 			size = lsize;
 		va = zfs_map_page(m[i], &sf);
 		error = dmu_read(os, zp->z_id, IDX_TO_OFF(m[i]->pindex),
@@ -5810,21 +5769,15 @@ static int
 			bzero(va + size, PAGE_SIZE - size);
 		zfs_unmap_page(sf);
 		if (error != 0)
-			break;
+			goto out;
 	}
 
 	zfs_vmobject_wlock(object);
-
-	for (i = reqstart; i < reqstart + reqsize; i++) {
-		if (!error)
-			m[i]->valid = VM_PAGE_BITS_ALL;
-		KASSERT(m[i]->dirty == 0, ("zfs_getpages: page %p is dirty", m[i]));
-		if (i != reqpage)
-			vm_page_readahead_finish(m[i]);
-	}
-
+	for (i = 0; i < pcount; i++)
+		m[i]->valid = VM_PAGE_BITS_ALL;
 	zfs_vmobject_wunlock(object);
 
+out:
 	ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
 	ZFS_EXIT(zfsvfs);
 	return (error ? zfs_vm_pagerret_error : zfs_vm_pagerret_ok);
@@ -5840,7 +5793,7 @@ zfs_freebsd_getpages(ap)
 	} */ *ap;
 {
 
-	return (zfs_getpages(ap->a_vp, ap->a_m, ap->a_count, ap->a_reqpage));
+	return (zfs_getpages(ap->a_vp, ap->a_m, ap->a_count));
 }
 
 static int
Index: sys/dev/drm2/i915/i915_gem.c
===================================================================
--- sys/dev/drm2/i915/i915_gem.c	(revision 286413)
+++ sys/dev/drm2/i915/i915_gem.c	(working copy)
@@ -4326,7 +4326,7 @@ i915_gem_wire_page(vm_object_t object, vm_pindex_t
 	page = vm_page_grab(object, pindex, VM_ALLOC_NORMAL);
 	if (page->valid != VM_PAGE_BITS_ALL) {
 		if (vm_pager_has_page(object, pindex, NULL, NULL)) {
-			rv = vm_pager_get_pages(object, &page, 1, 0);
+			rv = vm_pager_get_pages(object, &page, 1);
 			if (rv != VM_PAGER_OK) {
 				vm_page_lock(page);
 				vm_page_free(page);
Index: sys/dev/drm2/ttm/ttm_tt.c
===================================================================
--- sys/dev/drm2/ttm/ttm_tt.c	(revision 286413)
+++ sys/dev/drm2/ttm/ttm_tt.c	(working copy)
@@ -291,7 +291,7 @@ int ttm_tt_swapin(struct ttm_tt *ttm)
 		from_page = vm_page_grab(obj, i, VM_ALLOC_NORMAL);
 		if (from_page->valid != VM_PAGE_BITS_ALL) {
 			if (vm_pager_has_page(obj, i, NULL, NULL)) {
-				rv = vm_pager_get_pages(obj, &from_page, 1, 0);
+				rv = vm_pager_get_pages(obj, &from_page, 1);
 				if (rv != VM_PAGER_OK) {
 					vm_page_lock(from_page);
 					vm_page_free(from_page);
Index: sys/dev/md/md.c
===================================================================
--- sys/dev/md/md.c	(revision 286413)
+++ sys/dev/md/md.c	(working copy)
@@ -835,7 +835,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp)
 			if (m->valid == VM_PAGE_BITS_ALL)
 				rv = VM_PAGER_OK;
 			else
-				rv = vm_pager_get_pages(sc->object, &m, 1, 0);
+				rv = vm_pager_get_pages(sc->object, &m, 1);
 			if (rv == VM_PAGER_ERROR) {
 				vm_page_xunbusy(m);
 				break;
@@ -858,7 +858,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp)
 			}
 		} else if (bp->bio_cmd == BIO_WRITE) {
 			if (len != PAGE_SIZE && m->valid != VM_PAGE_BITS_ALL)
-				rv = vm_pager_get_pages(sc->object, &m, 1, 0);
+				rv = vm_pager_get_pages(sc->object, &m, 1);
 			else
 				rv = VM_PAGER_OK;
 			if (rv == VM_PAGER_ERROR) {
@@ -874,7 +874,7 @@ mdstart_swap(struct md_s *sc, struct bio *bp)
 			m->valid = VM_PAGE_BITS_ALL;
 		} else if (bp->bio_cmd == BIO_DELETE) {
 			if (len != PAGE_SIZE && m->valid != VM_PAGE_BITS_ALL)
-				rv = vm_pager_get_pages(sc->object, &m, 1, 0);
+				rv = vm_pager_get_pages(sc->object, &m, 1);
 			else
 				rv = VM_PAGER_OK;
 			if (rv == VM_PAGER_ERROR) {
Index: sys/fs/fuse/fuse_vnops.c
===================================================================
--- sys/fs/fuse/fuse_vnops.c	(revision 286413)
+++ sys/fs/fuse/fuse_vnops.c	(working copy)
@@ -1761,26 +1761,21 @@ fuse_vnop_getpages(struct vop_getpages_args *ap)
 	npages = btoc(count);
 
 	/*
-	 * If the requested page is partially valid, just return it and
-	 * allow the pager to zero-out the blanks.  Partially valid pages
-	 * can only occur at the file EOF.
+	 * If the last page is partially valid, just return it and allow
+	 * the pager to zero-out the blanks.  Partially valid pages can
+	 * only occur at the file EOF.
+	 *
+	 * XXXGL: is that true for FUSE, which is a local filesystem,
+	 * but still somewhat disconnected from the kernel?
 	 */
-
 	VM_OBJECT_WLOCK(vp->v_object);
-	fuse_vm_page_lock_queues();
-	if (pages[ap->a_reqpage]->valid != 0) {
-		for (i = 0; i < npages; ++i) {
-			if (i != ap->a_reqpage) {
-				fuse_vm_page_lock(pages[i]);
-				vm_page_free(pages[i]);
-				fuse_vm_page_unlock(pages[i]);
-			}
+	if (pages[npages - 1]->valid != 0) {
+		if (--npages == 0) {
+			VM_OBJECT_WUNLOCK(vp->v_object);
+			return (VM_PAGER_OK);
 		}
-		fuse_vm_page_unlock_queues();
-		VM_OBJECT_WUNLOCK(vp->v_object);
-		return 0;
-	}
-	fuse_vm_page_unlock_queues();
+		count = npages << PAGE_SHIFT;
+        }
 	VM_OBJECT_WUNLOCK(vp->v_object);
 
 	/*
@@ -1811,17 +1806,6 @@ fuse_vnop_getpages(struct vop_getpages_args *ap)
 
 	if (error && (uio.uio_resid == count)) {
 		FS_DEBUG("error %d\n", error);
-		VM_OBJECT_WLOCK(vp->v_object);
-		fuse_vm_page_lock_queues();
-		for (i = 0; i < npages; ++i) {
-			if (i != ap->a_reqpage) {
-				fuse_vm_page_lock(pages[i]);
-				vm_page_free(pages[i]);
-				fuse_vm_page_unlock(pages[i]);
-			}
-		}
-		fuse_vm_page_unlock_queues();
-		VM_OBJECT_WUNLOCK(vp->v_object);
 		return VM_PAGER_ERROR;
 	}
 	/*
@@ -1862,8 +1846,6 @@ fuse_vnop_getpages(struct vop_getpages_args *ap)
 			 */
 			;
 		}
-		if (i != ap->a_reqpage)
-			vm_page_readahead_finish(m);
 	}
 	fuse_vm_page_unlock_queues();
 	VM_OBJECT_WUNLOCK(vp->v_object);
Index: sys/fs/nfsclient/nfs_clbio.c
===================================================================
--- sys/fs/nfsclient/nfs_clbio.c	(revision 286413)
+++ sys/fs/nfsclient/nfs_clbio.c	(working copy)
@@ -132,12 +132,18 @@ ncl_getpages(struct vop_getpages_args *ap)
 	 * If the requested page is partially valid, just return it and
 	 * allow the pager to zero-out the blanks.  Partially valid pages
 	 * can only occur at the file EOF.
+	 *
+	 * XXXGL: is that true for NFS, where short read can occur???
 	 */
-	if (pages[ap->a_reqpage]->valid != 0) {
-		vm_pager_free_nonreq(object, pages, ap->a_reqpage, npages,
-		    FALSE);
-		return (VM_PAGER_OK);
+	VM_OBJECT_WLOCK(object);
+	if (pages[npages - 1]->valid != 0) {
+		if (--npages == 0) {
+			VM_OBJECT_WUNLOCK(object);
+			return (VM_PAGER_OK);
+		}
+		count = npages << PAGE_SHIFT;
 	}
+	VM_OBJECT_WUNLOCK(object);
 
 	/*
 	 * We use only the kva address for the buffer, but this is extremely
@@ -167,8 +173,6 @@ ncl_getpages(struct vop_getpages_args *ap)
 
 	if (error && (uio.uio_resid == count)) {
 		ncl_printf("nfs_getpages: error %d\n", error);
-		vm_pager_free_nonreq(object, pages, ap->a_reqpage, npages,
-		    FALSE);
 		return (VM_PAGER_ERROR);
 	}
 
@@ -212,8 +216,6 @@ ncl_getpages(struct vop_getpages_args *ap)
 			 */
 			;
 		}
-		if (i != ap->a_reqpage)
-			vm_page_readahead_finish(m);
 	}
 	VM_OBJECT_WUNLOCK(object);
 	return (0);
Index: sys/fs/smbfs/smbfs_io.c
===================================================================
--- sys/fs/smbfs/smbfs_io.c	(revision 286413)
+++ sys/fs/smbfs/smbfs_io.c	(working copy)
@@ -424,7 +424,7 @@ smbfs_getpages(ap)
 #ifdef SMBFS_RWGENERIC
 	return vop_stdgetpages(ap);
 #else
-	int i, error, nextoff, size, toff, npages, count, reqpage;
+	int i, error, nextoff, size, toff, npages, count;
 	struct uio uio;
 	struct iovec iov;
 	vm_offset_t kva;
@@ -436,7 +436,7 @@ smbfs_getpages(ap)
 	struct smbnode *np;
 	struct smb_cred *scred;
 	vm_object_t object;
-	vm_page_t *pages, m;
+	vm_page_t *pages;
 
 	vp = ap->a_vp;
 	if ((object = vp->v_object) == NULL) {
@@ -451,26 +451,21 @@ smbfs_getpages(ap)
 	pages = ap->a_m;
 	count = ap->a_count;
 	npages = btoc(count);
-	reqpage = ap->a_reqpage;
 
 	/*
 	 * If the requested page is partially valid, just return it and
 	 * allow the pager to zero-out the blanks.  Partially valid pages
 	 * can only occur at the file EOF.
+	 *
+	 * XXXGL: is that true for SMB filesystem?
 	 */
-	m = pages[reqpage];
-
 	VM_OBJECT_WLOCK(object);
-	if (m->valid != 0) {
-		for (i = 0; i < npages; ++i) {
-			if (i != reqpage) {
-				vm_page_lock(pages[i]);
-				vm_page_free(pages[i]);
-				vm_page_unlock(pages[i]);
-			}
+	if (pages[npages - 1]->valid != 0) {
+		if (--npages == 0) {
+			VM_OBJECT_WUNLOCK(object);
+			return (VM_PAGER_OK);
 		}
-		VM_OBJECT_WUNLOCK(object);
-		return 0;
+		count = npages << PAGE_SHIFT;
 	}
 	VM_OBJECT_WUNLOCK(object);
 
@@ -500,22 +495,14 @@ smbfs_getpages(ap)
 
 	relpbuf(bp, &smbfs_pbuf_freecnt);
 
-	VM_OBJECT_WLOCK(object);
 	if (error && (uio.uio_resid == count)) {
 		printf("smbfs_getpages: error %d\n",error);
-		for (i = 0; i < npages; i++) {
-			if (reqpage != i) {
-				vm_page_lock(pages[i]);
-				vm_page_free(pages[i]);
-				vm_page_unlock(pages[i]);
-			}
-		}
-		VM_OBJECT_WUNLOCK(object);
 		return VM_PAGER_ERROR;
 	}
 
 	size = count - uio.uio_resid;
 
+	VM_OBJECT_WLOCK(object);
 	for (i = 0, toff = 0; i < npages; i++, toff = nextoff) {
 		vm_page_t m;
 		nextoff = toff + PAGE_SIZE;
@@ -544,9 +531,6 @@ smbfs_getpages(ap)
 			 */
 			;
 		}
-
-		if (i != reqpage)
-			vm_page_readahead_finish(m);
 	}
 	VM_OBJECT_WUNLOCK(object);
 	return 0;
Index: sys/fs/tmpfs/tmpfs_subr.c
===================================================================
--- sys/fs/tmpfs/tmpfs_subr.c	(revision 286413)
+++ sys/fs/tmpfs/tmpfs_subr.c	(working copy)
@@ -1370,7 +1370,7 @@ retry:
 					VM_OBJECT_WLOCK(uobj);
 					goto retry;
 				} else if (m->valid != VM_PAGE_BITS_ALL)
-					rv = vm_pager_get_pages(uobj, &m, 1, 0);
+					rv = vm_pager_get_pages(uobj, &m, 1);
 				else
 					/* A cached page was reactivated. */
 					rv = VM_PAGER_OK;
Index: sys/kern/kern_exec.c
===================================================================
--- sys/kern/kern_exec.c	(revision 286413)
+++ sys/kern/kern_exec.c	(working copy)
@@ -940,8 +940,7 @@ int
 exec_map_first_page(imgp)
 	struct image_params *imgp;
 {
-	int rv, i;
-	int initial_pagein;
+	int rv, i, after, initial_pagein;
 	vm_page_t ma[VM_INITIAL_PAGEIN];
 	vm_object_t object;
 
@@ -957,9 +956,18 @@ exec_map_first_page(imgp)
 #endif
 	ma[0] = vm_page_grab(object, 0, VM_ALLOC_NORMAL);
 	if (ma[0]->valid != VM_PAGE_BITS_ALL) {
-		initial_pagein = VM_INITIAL_PAGEIN;
-		if (initial_pagein > object->size)
-			initial_pagein = object->size;
+		if (!vm_pager_has_page(object, 0, NULL, &after)) {
+			vm_page_xunbusy(ma[0]);
+			vm_page_lock(ma[0]);
+			vm_page_free(ma[0]);
+			vm_page_unlock(ma[0]);
+			VM_OBJECT_WUNLOCK(object);
+			return (EIO);
+		}
+		initial_pagein = min(after, VM_INITIAL_PAGEIN);
+		KASSERT(initial_pagein <= object->size,
+		    ("%s: initial_pagein %d object->size %ju",
+		    __func__, initial_pagein, (uintmax_t )object->size));
 		for (i = 1; i < initial_pagein; i++) {
 			if ((ma[i] = vm_page_next(ma[i - 1])) != NULL) {
 				if (ma[i]->valid)
@@ -974,16 +982,21 @@ exec_map_first_page(imgp)
 			}
 		}
 		initial_pagein = i;
-		rv = vm_pager_get_pages(object, ma, initial_pagein, 0);
+		rv = vm_pager_get_pages(object, ma, initial_pagein);
 		if (rv != VM_PAGER_OK) {
-			vm_page_lock(ma[0]);
-			vm_page_free(ma[0]);
-			vm_page_unlock(ma[0]);
+			for (i = 0; i < initial_pagein; i++) {
+				vm_page_xunbusy(ma[i]);
+				vm_page_lock(ma[i]);
+				vm_page_free(ma[i]);
+				vm_page_unlock(ma[i]);
+			}
 			VM_OBJECT_WUNLOCK(object);
 			return (EIO);
 		}
-	}
-	vm_page_xunbusy(ma[0]);
+	} else
+		initial_pagein = 1;
+	for (i = 0; i < initial_pagein; i++)
+		vm_page_xunbusy(ma[i]);
 	vm_page_lock(ma[0]);
 	vm_page_hold(ma[0]);
 	vm_page_activate(ma[0]);
Index: sys/kern/uipc_shm.c
===================================================================
--- sys/kern/uipc_shm.c	(revision 286413)
+++ sys/kern/uipc_shm.c	(working copy)
@@ -189,7 +189,7 @@ uiomove_object_page(vm_object_t obj, size_t len, s
 	m = vm_page_grab(obj, idx, VM_ALLOC_NORMAL);
 	if (m->valid != VM_PAGE_BITS_ALL) {
 		if (vm_pager_has_page(obj, idx, NULL, NULL)) {
-			rv = vm_pager_get_pages(obj, &m, 1, 0);
+			rv = vm_pager_get_pages(obj, &m, 1);
 			if (rv != VM_PAGER_OK) {
 				printf(
 	    "uiomove_object: vm_obj %p idx %jd valid %x pager error %d\n",
@@ -459,8 +459,7 @@ retry:
 					VM_OBJECT_WLOCK(object);
 					goto retry;
 				} else if (m->valid != VM_PAGE_BITS_ALL)
-					rv = vm_pager_get_pages(object, &m, 1,
-					    0);
+					rv = vm_pager_get_pages(object, &m, 1);
 				else
 					/* A cached page was reactivated. */
 					rv = VM_PAGER_OK;
Index: sys/kern/uipc_syscalls.c
===================================================================
--- sys/kern/uipc_syscalls.c	(revision 286413)
+++ sys/kern/uipc_syscalls.c	(working copy)
@@ -2033,7 +2033,7 @@ sendfile_readpage(vm_object_t obj, struct vnode *v
 		VM_OBJECT_WLOCK(obj);
 	} else {
 		if (vm_pager_has_page(obj, pindex, NULL, NULL)) {
-			rv = vm_pager_get_pages(obj, &m, 1, 0);
+			rv = vm_pager_get_pages(obj, &m, 1);
 			SFSTAT_INC(sf_iocnt);
 			if (rv != VM_PAGER_OK) {
 				vm_page_lock(m);
Index: sys/kern/vfs_default.c
===================================================================
--- sys/kern/vfs_default.c	(revision 286413)
+++ sys/kern/vfs_default.c	(working copy)
@@ -731,12 +731,11 @@ vop_stdgetpages(ap)
 		struct vnode *a_vp;
 		vm_page_t *a_m;
 		int a_count;
-		int a_reqpage;
 	} */ *ap;
 {
 
 	return vnode_pager_generic_getpages(ap->a_vp, ap->a_m,
-	    ap->a_count, ap->a_reqpage, NULL, NULL);
+	    ap->a_count, NULL, NULL);
 }
 
 static int
@@ -744,8 +743,8 @@ vop_stdgetpages_async(struct vop_getpages_async_ar
 {
 	int error;
 
-	error = VOP_GETPAGES(ap->a_vp, ap->a_m, ap->a_count, ap->a_reqpage);
-	ap->a_iodone(ap->a_arg, ap->a_m, ap->a_reqpage, error);
+	error = VOP_GETPAGES(ap->a_vp, ap->a_m, ap->a_count);
+	ap->a_iodone(ap->a_arg, ap->a_m, ap->a_count, error);
 	return (error);
 }
 
Index: sys/kern/vnode_if.src
===================================================================
--- sys/kern/vnode_if.src	(revision 286413)
+++ sys/kern/vnode_if.src	(working copy)
@@ -472,7 +472,6 @@ vop_getpages {
 	IN struct vnode *vp;
 	IN vm_page_t *m;
 	IN int count;
-	IN int reqpage;
 };
 
 
@@ -482,7 +481,6 @@ vop_getpages_async {
 	IN struct vnode *vp;
 	IN vm_page_t *m;
 	IN int count;
-	IN int reqpage;
 	IN vop_getpages_iodone_t *iodone;
 	IN void *arg;
 };
Index: sys/sys/buf.h
===================================================================
--- sys/sys/buf.h	(revision 286413)
+++ sys/sys/buf.h	(working copy)
@@ -122,14 +122,9 @@ struct buf {
 	struct	ucred *b_rcred;		/* Read credentials reference. */
 	struct	ucred *b_wcred;		/* Write credentials reference. */
 	union {
-		TAILQ_ENTRY(buf) bu_freelist; /* (Q) */
-		struct {
-			void	(*pg_iodone)(void *, vm_page_t *, int, int);
-			int	pg_reqpage;
-		} bu_pager;
-	} b_union;
-#define	b_freelist	b_union.bu_freelist
-#define	b_pager         b_union.bu_pager
+		TAILQ_ENTRY(buf) b_freelist; /* (Q) */
+		void	(*b_pgiodone)(void *, vm_page_t *, int, int);
+	};
 	union	cluster_info {
 		TAILQ_HEAD(cluster_list_head, buf) cluster_head;
 		TAILQ_ENTRY(buf) cluster_entry;
Index: sys/vm/default_pager.c
===================================================================
--- sys/vm/default_pager.c	(revision 286413)
+++ sys/vm/default_pager.c	(working copy)
@@ -56,7 +56,7 @@ __FBSDID("$FreeBSD$");
 static vm_object_t default_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
     vm_ooffset_t, struct ucred *);
 static void default_pager_dealloc(vm_object_t);
-static int default_pager_getpages(vm_object_t, vm_page_t *, int, int);
+static int default_pager_getpages(vm_object_t, vm_page_t *, int);
 static void default_pager_putpages(vm_object_t, vm_page_t *, int, 
 		boolean_t, int *);
 static boolean_t default_pager_haspage(vm_object_t, vm_pindex_t, int *, 
@@ -122,11 +122,10 @@ default_pager_dealloc(object)
  * see a vm_page with assigned swap here.
  */
 static int
-default_pager_getpages(object, m, count, reqpage)
+default_pager_getpages(object, m, count)
 	vm_object_t object;
 	vm_page_t *m;
 	int count;
-	int reqpage;
 {
 	return VM_PAGER_FAIL;
 }
Index: sys/vm/device_pager.c
===================================================================
--- sys/vm/device_pager.c	(revision 286413)
+++ sys/vm/device_pager.c	(working copy)
@@ -59,7 +59,7 @@ static void dev_pager_init(void);
 static vm_object_t dev_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
     vm_ooffset_t, struct ucred *);
 static void dev_pager_dealloc(vm_object_t);
-static int dev_pager_getpages(vm_object_t, vm_page_t *, int, int);
+static int dev_pager_getpages(vm_object_t, vm_page_t *, int);
 static void dev_pager_putpages(vm_object_t, vm_page_t *, int, 
 		boolean_t, int *);
 static boolean_t dev_pager_haspage(vm_object_t, vm_pindex_t, int *,
@@ -259,33 +259,27 @@ dev_pager_dealloc(object)
 }
 
 static int
-dev_pager_getpages(vm_object_t object, vm_page_t *ma, int count, int reqpage)
+dev_pager_getpages(vm_object_t object, vm_page_t *ma, int count)
 {
-	int error, i;
+	int error;
 
+	/* Since our putpages reports zero after/before, the count is 1. */
+	KASSERT(count == 1, ("%s: count %d", __func__, count));
 	VM_OBJECT_ASSERT_WLOCKED(object);
 	error = object->un_pager.devp.ops->cdev_pg_fault(object,
-	    IDX_TO_OFF(ma[reqpage]->pindex), PROT_READ, &ma[reqpage]);
+	    IDX_TO_OFF(ma[0]->pindex), PROT_READ, &ma[0]);
 
 	VM_OBJECT_ASSERT_WLOCKED(object);
 
-	for (i = 0; i < count; i++) {
-		if (i != reqpage) {
-			vm_page_lock(ma[i]);
-			vm_page_free(ma[i]);
-			vm_page_unlock(ma[i]);
-		}
-	}
-
 	if (error == VM_PAGER_OK) {
 		KASSERT((object->type == OBJT_DEVICE &&
-		     (ma[reqpage]->oflags & VPO_UNMANAGED) != 0) ||
+		     (ma[0]->oflags & VPO_UNMANAGED) != 0) ||
 		    (object->type == OBJT_MGTDEVICE &&
-		     (ma[reqpage]->oflags & VPO_UNMANAGED) == 0),
-		    ("Wrong page type %p %p", ma[reqpage], object));
+		     (ma[0]->oflags & VPO_UNMANAGED) == 0),
+		    ("Wrong page type %p %p", ma[0], object));
 		if (object->type == OBJT_DEVICE) {
 			TAILQ_INSERT_TAIL(&object->un_pager.devp.devp_pglist,
-			    ma[reqpage], plinks.q);
+			    ma[0], plinks.q);
 		}
 	}
 
Index: sys/vm/phys_pager.c
===================================================================
--- sys/vm/phys_pager.c	(revision 286413)
+++ sys/vm/phys_pager.c	(working copy)
@@ -139,7 +139,7 @@ phys_pager_dealloc(vm_object_t object)
  * Fill as many pages as vm_fault has allocated for us.
  */
 static int
-phys_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+phys_pager_getpages(vm_object_t object, vm_page_t *m, int count)
 {
 	int i;
 
@@ -154,13 +154,6 @@ static int
 		    ("phys_pager_getpages: partially valid page %p", m[i]));
 		KASSERT(m[i]->dirty == 0,
 		    ("phys_pager_getpages: dirty page %p", m[i]));
-		/* The requested page must remain busy, the others not. */
-		if (i == reqpage) {
-			vm_page_lock(m[i]);
-			vm_page_flash(m[i]);
-			vm_page_unlock(m[i]);
-		} else
-			vm_page_xunbusy(m[i]);
 	}
 	return (VM_PAGER_OK);
 }
Index: sys/vm/sg_pager.c
===================================================================
--- sys/vm/sg_pager.c	(revision 286413)
+++ sys/vm/sg_pager.c	(working copy)
@@ -49,7 +49,7 @@ __FBSDID("$FreeBSD$");
 static vm_object_t sg_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
     vm_ooffset_t, struct ucred *);
 static void sg_pager_dealloc(vm_object_t);
-static int sg_pager_getpages(vm_object_t, vm_page_t *, int, int);
+static int sg_pager_getpages(vm_object_t, vm_page_t *, int);
 static void sg_pager_putpages(vm_object_t, vm_page_t *, int, 
 		boolean_t, int *);
 static boolean_t sg_pager_haspage(vm_object_t, vm_pindex_t, int *,
@@ -135,7 +135,7 @@ sg_pager_dealloc(vm_object_t object)
 }
 
 static int
-sg_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+sg_pager_getpages(vm_object_t object, vm_page_t *m, int count)
 {
 	struct sglist *sg;
 	vm_page_t m_paddr, page;
@@ -145,11 +145,13 @@ static int
 	size_t space;
 	int i;
 
+	/* Since our putpages reports zero after/before, the count is 1. */
+	KASSERT(count == 1, ("%s: count %d", __func__, count));
 	VM_OBJECT_ASSERT_WLOCKED(object);
 	sg = object->handle;
 	memattr = object->memattr;
 	VM_OBJECT_WUNLOCK(object);
-	offset = m[reqpage]->pindex;
+	offset = m[0]->pindex;
 
 	/*
 	 * Lookup the physical address of the requested page.  An initial
@@ -178,7 +180,7 @@ static int
 	}
 
 	/* Return a fake page for the requested page. */
-	KASSERT(!(m[reqpage]->flags & PG_FICTITIOUS),
+	KASSERT(!(m[0]->flags & PG_FICTITIOUS),
 	    ("backing page for SG is fake"));
 
 	/* Construct a new fake page. */
@@ -185,17 +187,9 @@ static int
 	page = vm_page_getfake(paddr, memattr);
 	VM_OBJECT_WLOCK(object);
 	TAILQ_INSERT_TAIL(&object->un_pager.sgp.sgp_pglist, page, plinks.q);
-
-	/* Free the original pages and insert this fake page into the object. */
-	for (i = 0; i < count; i++) {
-		if (i == reqpage &&
-		    vm_page_replace(page, object, offset) != m[i])
-			panic("sg_pager_getpages: invalid place replacement");
-		vm_page_lock(m[i]);
-		vm_page_free(m[i]);
-		vm_page_unlock(m[i]);
-	}
-	m[reqpage] = page;
+	if (vm_page_replace(page, object, offset) != m[0])
+		panic("sg_pager_getpages: invalid place replacement");
+	m[0] = page;
 	page->valid = VM_PAGE_BITS_ALL;
 
 	return (VM_PAGER_OK);
Index: sys/vm/swap_pager.c
===================================================================
--- sys/vm/swap_pager.c	(revision 286413)
+++ sys/vm/swap_pager.c	(working copy)
@@ -359,8 +359,8 @@ static vm_object_t
 		swap_pager_alloc(void *handle, vm_ooffset_t size,
 		    vm_prot_t prot, vm_ooffset_t offset, struct ucred *);
 static void	swap_pager_dealloc(vm_object_t object);
-static int	swap_pager_getpages(vm_object_t, vm_page_t *, int, int);
-static int	swap_pager_getpages_async(vm_object_t, vm_page_t *, int, int,
+static int	swap_pager_getpages(vm_object_t, vm_page_t *, int);
+static int	swap_pager_getpages_async(vm_object_t, vm_page_t *, int,
     pgo_getpages_iodone_t, void *);
 static void	swap_pager_putpages(vm_object_t, vm_page_t *, int, boolean_t, int *);
 static boolean_t
@@ -415,16 +415,6 @@ static void swp_pager_meta_free(vm_object_t, vm_pi
 static void swp_pager_meta_free_all(vm_object_t);
 static daddr_t swp_pager_meta_ctl(vm_object_t, vm_pindex_t, int);
 
-static void
-swp_pager_free_nrpage(vm_page_t m)
-{
-
-	vm_page_lock(m);
-	if (m->wire_count == 0)
-		vm_page_free(m);
-	vm_page_unlock(m);
-}
-
 /*
  * SWP_SIZECHECK() -	update swap_pager_full indication
  *
@@ -1105,16 +1095,11 @@ swap_pager_unswapped(vm_page_t m)
  *	left busy, but the others adjusted.
  */
 static int
-swap_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+swap_pager_getpages(vm_object_t object, vm_page_t *m, int count)
 {
 	struct buf *bp;
-	vm_page_t mreq;
-	int i;
-	int j;
 	daddr_t blk;
 
-	mreq = m[reqpage];
-
 	/*
 	 * Calculate range to retrieve.  The pages have already been assigned
 	 * their swapblks.  We require a *contiguous* range but we know it to
@@ -1124,45 +1109,18 @@ static int
 	 *
 	 * The swp_*() calls must be made with the object locked.
 	 */
-	blk = swp_pager_meta_ctl(mreq->object, mreq->pindex, 0);
+	blk = swp_pager_meta_ctl(m[0]->object, m[0]->pindex, 0);
 
-	for (i = reqpage - 1; i >= 0; --i) {
-		daddr_t iblk;
-
-		iblk = swp_pager_meta_ctl(m[i]->object, m[i]->pindex, 0);
-		if (blk != iblk + (reqpage - i))
-			break;
-	}
-	++i;
-
-	for (j = reqpage + 1; j < count; ++j) {
-		daddr_t jblk;
-
-		jblk = swp_pager_meta_ctl(m[j]->object, m[j]->pindex, 0);
-		if (blk != jblk - (j - reqpage))
-			break;
-	}
-
-	/*
-	 * free pages outside our collection range.   Note: we never free
-	 * mreq, it must remain busy throughout.
-	 */
-	if (0 < i || j < count) {
-		int k;
-
-		for (k = 0; k < i; ++k)
-			swp_pager_free_nrpage(m[k]);
-		for (k = j; k < count; ++k)
-			swp_pager_free_nrpage(m[k]);
-	}
-
-	/*
-	 * Return VM_PAGER_FAIL if we have nothing to do.  Return mreq
-	 * still busy, but the others unbusied.
-	 */
 	if (blk == SWAPBLK_NONE)
 		return (VM_PAGER_FAIL);
 
+#ifdef INVARIANTS
+	for (int i = 0; i < count; i++)
+		KASSERT(blk + i ==
+		    swp_pager_meta_ctl(m[i]->object, m[i]->pindex, 0),
+		    ("%s: range is not contiguous", __func__));
+#endif
+
 	/*
 	 * Getpbuf() can sleep.
 	 */
@@ -1177,21 +1135,16 @@ static int
 	bp->b_iodone = swp_pager_async_iodone;
 	bp->b_rcred = crhold(thread0.td_ucred);
 	bp->b_wcred = crhold(thread0.td_ucred);
-	bp->b_blkno = blk - (reqpage - i);
-	bp->b_bcount = PAGE_SIZE * (j - i);
-	bp->b_bufsize = PAGE_SIZE * (j - i);
-	bp->b_pager.pg_reqpage = reqpage - i;
+	bp->b_blkno = blk;
+	bp->b_bcount = PAGE_SIZE * count;
+	bp->b_bufsize = PAGE_SIZE * count;
+	bp->b_npages = count;
 
 	VM_OBJECT_WLOCK(object);
-	{
-		int k;
-
-		for (k = i; k < j; ++k) {
-			bp->b_pages[k - i] = m[k];
-			m[k]->oflags |= VPO_SWAPINPROG;
-		}
+	for (int i = 0; i < count; i++) {
+		bp->b_pages[i] = m[i];
+		m[i]->oflags |= VPO_SWAPINPROG;
 	}
-	bp->b_npages = j - i;
 
 	PCPU_INC(cnt.v_swapin);
 	PCPU_ADD(cnt.v_swappgsin, bp->b_npages);
@@ -1223,8 +1176,8 @@ static int
 	 * is set in the meta-data.
 	 */
 	VM_OBJECT_WLOCK(object);
-	while ((mreq->oflags & VPO_SWAPINPROG) != 0) {
-		mreq->oflags |= VPO_SWAPSLEEP;
+	while ((m[0]->oflags & VPO_SWAPINPROG) != 0) {
+		m[0]->oflags |= VPO_SWAPSLEEP;
 		PCPU_INC(cnt.v_intrans);
 		if (VM_OBJECT_SLEEP(object, &object->paging_in_progress, PSWP,
 		    "swread", hz * 20)) {
@@ -1235,16 +1188,14 @@ static int
 	}
 
 	/*
-	 * mreq is left busied after completion, but all the other pages
-	 * are freed.  If we had an unrecoverable read error the page will
-	 * not be valid.
+	 * If we had an unrecoverable read error pages will not be valid.
 	 */
-	if (mreq->valid != VM_PAGE_BITS_ALL) {
-		return (VM_PAGER_ERROR);
-	} else {
-		return (VM_PAGER_OK);
-	}
+	for (int i = 0; i < count; i++)
+		if (m[i]->valid != VM_PAGE_BITS_ALL)
+			return (VM_PAGER_ERROR);
 
+	return (VM_PAGER_OK);
+
 	/*
 	 * A final note: in a low swap situation, we cannot deallocate swap
 	 * and mark a page dirty here because the caller is likely to mark
@@ -1261,11 +1212,11 @@ static int
  */
 static int
 swap_pager_getpages_async(vm_object_t object, vm_page_t *m, int count,
-    int reqpage, pgo_getpages_iodone_t iodone, void *arg)
+    pgo_getpages_iodone_t iodone, void *arg)
 {
 	int r, error;
 
-	r = swap_pager_getpages(object, m, count, reqpage);
+	r = swap_pager_getpages(object, m, count);
 	VM_OBJECT_WUNLOCK(object);
 	switch (r) {
 	case VM_PAGER_OK:
@@ -1529,33 +1480,11 @@ swp_pager_async_iodone(struct buf *bp)
 			 */
 			if (bp->b_iocmd == BIO_READ) {
 				/*
-				 * When reading, reqpage needs to stay
-				 * locked for the parent, but all other
-				 * pages can be freed.  We still want to
-				 * wakeup the parent waiting on the page,
-				 * though.  ( also: pg_reqpage can be -1 and
-				 * not match anything ).
-				 *
-				 * We have to wake specifically requested pages
-				 * up too because we cleared VPO_SWAPINPROG and
-				 * someone may be waiting for that.
-				 *
 				 * NOTE: for reads, m->dirty will probably
 				 * be overridden by the original caller of
 				 * getpages so don't play cute tricks here.
 				 */
 				m->valid = 0;
-				if (i != bp->b_pager.pg_reqpage)
-					swp_pager_free_nrpage(m);
-				else {
-					vm_page_lock(m);
-					vm_page_flash(m);
-					vm_page_unlock(m);
-				}
-				/*
-				 * If i == bp->b_pager.pg_reqpage, do not wake
-				 * the page up.  The caller needs to.
-				 */
 			} else {
 				/*
 				 * If a write error occurs, reactivate page
@@ -1577,38 +1506,12 @@ swp_pager_async_iodone(struct buf *bp)
 			 * want to do that anyway, but it was an optimization
 			 * that existed in the old swapper for a time before
 			 * it got ripped out due to precisely this problem.
-			 *
-			 * If not the requested page then deactivate it.
-			 *
-			 * Note that the requested page, reqpage, is left
-			 * busied, but we still have to wake it up.  The
-			 * other pages are released (unbusied) by
-			 * vm_page_xunbusy().
 			 */
 			KASSERT(!pmap_page_is_mapped(m),
 			    ("swp_pager_async_iodone: page %p is mapped", m));
-			m->valid = VM_PAGE_BITS_ALL;
 			KASSERT(m->dirty == 0,
 			    ("swp_pager_async_iodone: page %p is dirty", m));
-
-			/*
-			 * We have to wake specifically requested pages
-			 * up too because we cleared VPO_SWAPINPROG and
-			 * could be waiting for it in getpages.  However,
-			 * be sure to not unbusy getpages specifically
-			 * requested page - getpages expects it to be
-			 * left busy.
-			 */
-			if (i != bp->b_pager.pg_reqpage) {
-				vm_page_lock(m);
-				vm_page_deactivate(m);
-				vm_page_unlock(m);
-				vm_page_xunbusy(m);
-			} else {
-				vm_page_lock(m);
-				vm_page_flash(m);
-				vm_page_unlock(m);
-			}
+			m->valid = VM_PAGE_BITS_ALL;
 		} else {
 			/*
 			 * For write success, clear the dirty
@@ -1729,7 +1632,7 @@ swp_pager_force_pagein(vm_object_t object, vm_pind
 		return;
 	}
 
-	if (swap_pager_getpages(object, &m, 1, 0) != VM_PAGER_OK)
+	if (swap_pager_getpages(object, &m, 1) != VM_PAGER_OK)
 		panic("swap_pager_force_pagein: read from swap failed");/*XXX*/
 	vm_object_pip_wakeup(object);
 	vm_page_dirty(m);
Index: sys/vm/vm_fault.c
===================================================================
--- sys/vm/vm_fault.c	(revision 286413)
+++ sys/vm/vm_fault.c	(working copy)
@@ -669,14 +669,20 @@ vnode_locked:
 			    fs.m, behind, ahead, marray, &reqpage);
 
 			rv = faultcount ?
-			    vm_pager_get_pages(fs.object, marray, faultcount,
-				reqpage) : VM_PAGER_FAIL;
+			    vm_pager_get_pages(fs.object, marray, faultcount) :
+			    VM_PAGER_FAIL;
 
 			if (rv == VM_PAGER_OK) {
 				/*
 				 * Found the page. Leave it busy while we play
-				 * with it.
-				 *
+				 * with it.  Unbusy companion pages.
+				 */
+				for (int i = 0; i < faultcount; i++) {
+					if (i == reqpage)
+						continue;
+					vm_page_readahead_finish(marray[i]);
+				}
+				/*
 				 * Pager could have changed the page.  Pager
 				 * is responsible for disposition of old page
 				 * if moved.
Index: sys/vm/vm_glue.c
===================================================================
--- sys/vm/vm_glue.c	(revision 286413)
+++ sys/vm/vm_glue.c	(working copy)
@@ -238,7 +238,7 @@ vm_imgact_hold_page(vm_object_t object, vm_ooffset
 	pindex = OFF_TO_IDX(offset);
 	m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL);
 	if (m->valid != VM_PAGE_BITS_ALL) {
-		rv = vm_pager_get_pages(object, &m, 1, 0);
+		rv = vm_pager_get_pages(object, &m, 1);
 		if (rv != VM_PAGER_OK) {
 			vm_page_lock(m);
 			vm_page_free(m);
@@ -567,37 +567,37 @@ vm_thread_swapin(struct thread *td)
 {
 	vm_object_t ksobj;
 	vm_page_t ma[KSTACK_MAX_PAGES];
-	int i, j, pages, rv;
+	int pages;
 
 	pages = td->td_kstack_pages;
 	ksobj = td->td_kstack_obj;
 	VM_OBJECT_WLOCK(ksobj);
-	for (i = 0; i < pages; i++)
+	for (int i = 0; i < pages; i++)
 		ma[i] = vm_page_grab(ksobj, i, VM_ALLOC_NORMAL |
 		    VM_ALLOC_WIRED);
-	for (i = 0; i < pages; i++) {
-		if (ma[i]->valid != VM_PAGE_BITS_ALL) {
-			vm_page_assert_xbusied(ma[i]);
-			vm_object_pip_add(ksobj, 1);
-			for (j = i + 1; j < pages; j++) {
-				if (ma[j]->valid != VM_PAGE_BITS_ALL)
-					vm_page_assert_xbusied(ma[j]);
-				if (ma[j]->valid == VM_PAGE_BITS_ALL)
-					break;
-			}
-			rv = vm_pager_get_pages(ksobj, ma + i, j - i, 0);
-			if (rv != VM_PAGER_OK)
-	panic("vm_thread_swapin: cannot get kstack for proc: %d",
-				    td->td_proc->p_pid);
-			/*
-			 * All pages in the array are in place, due to the
-			 * pager is always the swap pager, which doesn't
-			 * free or remove wired non-req pages from object.
-			 */
-			vm_object_pip_wakeup(ksobj);
+	for (int i = 0; i < pages;) {
+		int j, a, count, rv;
+
+		vm_page_assert_xbusied(ma[i]);
+		if (ma[i]->valid == VM_PAGE_BITS_ALL) {
 			vm_page_xunbusy(ma[i]);
-		} else if (vm_page_xbusied(ma[i]))
-			vm_page_xunbusy(ma[i]);
+			i++;
+			continue;
+		}
+		vm_object_pip_add(ksobj, 1);
+		for (j = i + 1; j < pages; j++)
+			if (ma[j]->valid == VM_PAGE_BITS_ALL)
+				break;
+		rv = vm_pager_has_page(ksobj, ma[i]->pindex, NULL, &a);
+		KASSERT(rv == 1, ("%s: missing page %p", __func__, ma[i]));
+		count = min(a + 1, j - i);
+		rv = vm_pager_get_pages(ksobj, ma + i, count);
+		KASSERT(rv == VM_PAGER_OK, ("%s: cannot get kstack for proc %d",
+		    __func__, td->td_proc->p_pid));
+		vm_object_pip_wakeup(ksobj);
+		for (j = i; j < i + count; j++)
+			vm_page_xunbusy(ma[j]);
+		i += count;
 	}
 	VM_OBJECT_WUNLOCK(ksobj);
 	pmap_qenter(td->td_kstack, ma, pages);
Index: sys/vm/vm_object.c
===================================================================
--- sys/vm/vm_object.c	(revision 286413)
+++ sys/vm/vm_object.c	(working copy)
@@ -2036,7 +2036,7 @@ vm_object_populate(vm_object_t object, vm_pindex_t
 	for (pindex = start; pindex < end; pindex++) {
 		m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL);
 		if (m->valid != VM_PAGE_BITS_ALL) {
-			rv = vm_pager_get_pages(object, &m, 1, 0);
+			rv = vm_pager_get_pages(object, &m, 1);
 			if (rv != VM_PAGER_OK) {
 				vm_page_lock(m);
 				vm_page_free(m);
Index: sys/vm/vm_page.c
===================================================================
--- sys/vm/vm_page.c	(revision 286413)
+++ sys/vm/vm_page.c	(working copy)
@@ -977,38 +977,28 @@ vm_page_free_zero(vm_page_t m)
 
 /*
  * Unbusy and handle the page queueing for a page from the VOP_GETPAGES()
- * array which is not the request page.
+ * array which was read ahead.
  */
 void
 vm_page_readahead_finish(vm_page_t m)
 {
 
-	if (m->valid != 0) {
-		/*
-		 * Since the page is not the requested page, whether
-		 * it should be activated or deactivated is not
-		 * obvious.  Empirical results have shown that
-		 * deactivating the page is usually the best choice,
-		 * unless the page is wanted by another thread.
-		 */
-		vm_page_lock(m);
-		if ((m->busy_lock & VPB_BIT_WAITERS) != 0)
-			vm_page_activate(m);
-		else
-			vm_page_deactivate(m);
-		vm_page_unlock(m);
-		vm_page_xunbusy(m);
-	} else {
-		/*
-		 * Free the completely invalid page.  Such page state
-		 * occurs due to the short read operation which did
-		 * not covered our page at all, or in case when a read
-		 * error happens.
-		 */
-		vm_page_lock(m);
-		vm_page_free(m);
-		vm_page_unlock(m);
-	}
+	/* We shouldn't put invalid pages on queues. */
+	KASSERT(m->valid != 0, ("%s: %p is invalid", __func__, m));
+
+	/*
+	 * Since the page is not the actually needed one, whether it should
+	 * be activated or deactivated is not obvious.  Empirical results
+	 * have shown that deactivating the page is usually the best choice,
+	 * unless the page is wanted by another thread.
+	 */
+	vm_page_lock(m);
+	if ((m->busy_lock & VPB_BIT_WAITERS) != 0)
+		vm_page_activate(m);
+	else
+		vm_page_deactivate(m);
+	vm_page_unlock(m);
+	vm_page_xunbusy(m);
 }
 
 /*
Index: sys/vm/vm_pager.c
===================================================================
--- sys/vm/vm_pager.c	(revision 286413)
+++ sys/vm/vm_pager.c	(working copy)
@@ -282,45 +282,45 @@ vm_pager_assert_in(vm_object_t object, vm_page_t *
  * The requested page must be fully valid on successful return.
  */
 int
-vm_pager_get_pages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+vm_pager_get_pages(vm_object_t object, vm_page_t *m, int count)
 {
+#ifdef INVARIANTS
+	vm_pindex_t pindex = m[0]->pindex;
+#endif
 	int r;
 
 	vm_pager_assert_in(object, m, count);
 
-	r = (*pagertab[object->type]->pgo_getpages)(object, m, count, reqpage);
+	r = (*pagertab[object->type]->pgo_getpages)(object, m, count);
 	if (r != VM_PAGER_OK)
 		return (r);
 
-	/*
-	 * If pager has replaced the page, assert that it had
-	 * updated the array.  Also assert that page is still
-	 * busied.
-	 */
-	KASSERT(m[reqpage] == vm_page_lookup(object, m[reqpage]->pindex),
-	    ("%s: mismatch page %p pindex %ju", __func__,
-	    m[reqpage], (uintmax_t )m[reqpage]->pindex));
-	vm_page_assert_xbusied(m[reqpage]);
-
-	/*
-	 * Pager didn't fill up entire page.  Zero out
-	 * partially filled data.
-	 */
-	if (m[reqpage]->valid != VM_PAGE_BITS_ALL)
-		vm_page_zero_invalid(m[reqpage], TRUE);
-
+	for (int i = 0; i < count; i++) {
+		/*
+		 * If pager has replaced a page, assert that it had
+		 * updated the array.
+		 */
+		KASSERT(m[i] == vm_page_lookup(object, pindex++),
+		    ("%s: mismatch page %p pindex %ju", __func__,
+		    m[i], (uintmax_t )pindex - 1));
+		/*
+		 * Zero out partially filled data.
+		 */
+		if (m[i]->valid != VM_PAGE_BITS_ALL)
+			vm_page_zero_invalid(m[i], TRUE);
+	}
 	return (VM_PAGER_OK);
 }
 
 int
 vm_pager_get_pages_async(vm_object_t object, vm_page_t *m, int count,
-    int reqpage, pgo_getpages_iodone_t iodone, void *arg)
+    pgo_getpages_iodone_t iodone, void *arg)
 {
 
 	vm_pager_assert_in(object, m, count);
 
 	return ((*pagertab[object->type]->pgo_getpages_async)(object, m,
-	    count, reqpage, iodone, arg));
+	    count, iodone, arg));
 }
 
 /*
@@ -355,39 +355,6 @@ vm_pager_object_lookup(struct pagerlst *pg_list, v
 }
 
 /*
- * Free the non-requested pages from the given array.  To remove all pages,
- * caller should provide out of range reqpage number.
- */
-void
-vm_pager_free_nonreq(vm_object_t object, vm_page_t ma[], int reqpage,
-    int npages, boolean_t object_locked)
-{
-	enum { UNLOCKED, CALLER_LOCKED, INTERNALLY_LOCKED } locked;
-	int i;
-
-	if (object_locked) {
-		VM_OBJECT_ASSERT_WLOCKED(object);
-		locked = CALLER_LOCKED;
-	} else {
-		VM_OBJECT_ASSERT_UNLOCKED(object);
-		locked = UNLOCKED;
-	}
-	for (i = 0; i < npages; ++i) {
-		if (i != reqpage) {
-			if (locked == UNLOCKED) {
-				VM_OBJECT_WLOCK(object);
-				locked = INTERNALLY_LOCKED;
-			}
-			vm_page_lock(ma[i]);
-			vm_page_free(ma[i]);
-			vm_page_unlock(ma[i]);
-		}
-	}
-	if (locked == INTERNALLY_LOCKED)
-		VM_OBJECT_WUNLOCK(object);
-}
-
-/*
  * initialize a physical buffer
  */
 
Index: sys/vm/vm_pager.h
===================================================================
--- sys/vm/vm_pager.h	(revision 286413)
+++ sys/vm/vm_pager.h	(working copy)
@@ -50,9 +50,9 @@ typedef void pgo_init_t(void);
 typedef vm_object_t pgo_alloc_t(void *, vm_ooffset_t, vm_prot_t, vm_ooffset_t,
     struct ucred *);
 typedef void pgo_dealloc_t(vm_object_t);
-typedef int pgo_getpages_t(vm_object_t, vm_page_t *, int, int);
+typedef int pgo_getpages_t(vm_object_t, vm_page_t *, int);
 typedef void pgo_getpages_iodone_t(void *, vm_page_t *, int, int);
-typedef int pgo_getpages_async_t(vm_object_t, vm_page_t *, int, int,
+typedef int pgo_getpages_async_t(vm_object_t, vm_page_t *, int,
     pgo_getpages_iodone_t, void *);
 typedef void pgo_putpages_t(vm_object_t, vm_page_t *, int, int, int *);
 typedef boolean_t pgo_haspage_t(vm_object_t, vm_pindex_t, int *, int *);
@@ -106,14 +106,12 @@ vm_object_t vm_pager_allocate(objtype_t, void *, v
     vm_ooffset_t, struct ucred *);
 void vm_pager_bufferinit(void);
 void vm_pager_deallocate(vm_object_t);
-int vm_pager_get_pages(vm_object_t, vm_page_t *, int, int);
-int vm_pager_get_pages_async(vm_object_t, vm_page_t *, int, int,
+int vm_pager_get_pages(vm_object_t, vm_page_t *, int);
+int vm_pager_get_pages_async(vm_object_t, vm_page_t *, int,
     pgo_getpages_iodone_t, void *);
 static __inline boolean_t vm_pager_has_page(vm_object_t, vm_pindex_t, int *, int *);
 void vm_pager_init(void);
 vm_object_t vm_pager_object_lookup(struct pagerlst *, void *);
-void vm_pager_free_nonreq(vm_object_t object, vm_page_t ma[], int reqpage,
-    int npages, boolean_t object_locked);
 
 static __inline void
 vm_pager_put_pages(
Index: sys/vm/vnode_pager.c
===================================================================
--- sys/vm/vnode_pager.c	(revision 286413)
+++ sys/vm/vnode_pager.c	(working copy)
@@ -84,11 +84,9 @@ static int vnode_pager_addr(struct vnode *vp, vm_o
 static int vnode_pager_input_smlfs(vm_object_t object, vm_page_t m);
 static int vnode_pager_input_old(vm_object_t object, vm_page_t m);
 static void vnode_pager_dealloc(vm_object_t);
-static int vnode_pager_local_getpages0(struct vnode *, vm_page_t *, int, int,
+static int vnode_pager_getpages(vm_object_t, vm_page_t *, int);
+static int vnode_pager_getpages_async(vm_object_t, vm_page_t *, int,
     vop_getpages_iodone_t, void *);
-static int vnode_pager_getpages(vm_object_t, vm_page_t *, int, int);
-static int vnode_pager_getpages_async(vm_object_t, vm_page_t *, int, int,
-    vop_getpages_iodone_t, void *);
 static void vnode_pager_putpages(vm_object_t, vm_page_t *, int, int, int *);
 static boolean_t vnode_pager_haspage(vm_object_t, vm_pindex_t, int *, int *);
 static vm_object_t vnode_pager_alloc(void *, vm_ooffset_t, vm_prot_t,
@@ -673,7 +671,7 @@ vnode_pager_input_old(vm_object_t object, vm_page_
  * backing vp's VOP_GETPAGES.
  */
 static int
-vnode_pager_getpages(vm_object_t object, vm_page_t *m, int count, int reqpage)
+vnode_pager_getpages(vm_object_t object, vm_page_t *m, int count)
 {
 	int rtval;
 	struct vnode *vp;
@@ -681,7 +679,7 @@ static int
 
 	vp = object->handle;
 	VM_OBJECT_WUNLOCK(object);
-	rtval = VOP_GETPAGES(vp, m, bytes, reqpage);
+	rtval = VOP_GETPAGES(vp, m, bytes);
 	KASSERT(rtval != EOPNOTSUPP,
 	    ("vnode_pager: FS getpages not implemented\n"));
 	VM_OBJECT_WLOCK(object);
@@ -690,7 +688,7 @@ static int
 
 static int
 vnode_pager_getpages_async(vm_object_t object, vm_page_t *m, int count,
-    int reqpage, vop_getpages_iodone_t iodone, void *arg)
+    vop_getpages_iodone_t iodone, void *arg)
 {
 	struct vnode *vp;
 	int rtval;
@@ -697,8 +695,7 @@ vnode_pager_getpages_async(vm_object_t object, vm_
 
 	vp = object->handle;
 	VM_OBJECT_WUNLOCK(object);
-	rtval = VOP_GETPAGES_ASYNC(vp, m, count * PAGE_SIZE, reqpage,
-	    iodone, arg);
+	rtval = VOP_GETPAGES_ASYNC(vp, m, count * PAGE_SIZE, iodone, arg);
 	KASSERT(rtval != EOPNOTSUPP,
 	    ("vnode_pager: FS getpages_async not implemented\n"));
 	VM_OBJECT_WLOCK(object);
@@ -714,8 +711,8 @@ int
 vnode_pager_local_getpages(struct vop_getpages_args *ap)
 {
 
-	return (vnode_pager_local_getpages0(ap->a_vp, ap->a_m, ap->a_count,
-	    ap->a_reqpage, NULL, NULL));
+	return (vnode_pager_generic_getpages(ap->a_vp, ap->a_m, ap->a_count,
+	    NULL, NULL));
 }
 
 int
@@ -722,42 +719,10 @@ int
 vnode_pager_local_getpages_async(struct vop_getpages_async_args *ap)
 {
 
-	return (vnode_pager_local_getpages0(ap->a_vp, ap->a_m, ap->a_count,
-	    ap->a_reqpage, ap->a_iodone, ap->a_arg));
+	return (vnode_pager_generic_getpages(ap->a_vp, ap->a_m, ap->a_count,
+	    ap->a_iodone, ap->a_arg));
 }
 
-static int
-vnode_pager_local_getpages0(struct vnode *vp, vm_page_t *m, int bytecount,
-    int reqpage, vop_getpages_iodone_t iodone, void *arg)
-{
-	vm_page_t mreq;
-
-	mreq = m[reqpage];
-
-	/*
-	 * Since the caller has busied the requested page, that page's valid
-	 * field will not be changed by other threads.
-	 */
-	vm_page_assert_xbusied(mreq);
-
-	/*
-	 * The requested page has valid blocks.  Invalid part can only
-	 * exist at the end of file, and the page is made fully valid
-	 * by zeroing in vm_pager_get_pages().  Free non-requested
-	 * pages, since no i/o is done to read its content.
-	 */
-	if (mreq->valid != 0) {
-		vm_pager_free_nonreq(mreq->object, m, reqpage,
-		    round_page(bytecount) / PAGE_SIZE, FALSE);
-		if (iodone != NULL)
-			iodone(arg, m, reqpage, 0);
-		return (VM_PAGER_OK);
-	}
-
-	return (vnode_pager_generic_getpages(vp, m, bytecount, reqpage,
-	    iodone, arg));
-}
-
 /*
  * This is now called from local media FS's to operate against their
  * own vnodes if they fail to implement VOP_GETPAGES.
@@ -764,31 +729,47 @@ vnode_pager_local_getpages_async(struct vop_getpag
  */
 int
 vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *m, int bytecount,
-    int reqpage, vop_getpages_iodone_t iodone, void *arg)
+    vop_getpages_iodone_t iodone, void *arg)
 {
 	vm_object_t object;
 	off_t foff;
-	int i, j, size, bsize, first, *freecnt;
-	daddr_t firstaddr, reqblock;
+	int error, count, bsize, i, after, secmask, *freecnt;
+	daddr_t reqblock;
 	struct bufobj *bo;
-	int runpg;
-	int runend;
 	struct buf *bp;
-	int count;
-	int error;
 
-	object = vp->v_object;
-	count = bytecount / PAGE_SIZE;
+	KASSERT(vp->v_type != VCHR && vp->v_type != VBLK,
+	    ("%s does not support devices", __func__));
+	KASSERT(bytecount > 0 && (bytecount & ~PAGE_MASK) == bytecount,
+	    ("%s: bytecount %d", __func__, bytecount));
 
-	KASSERT(vp->v_type != VCHR && vp->v_type != VBLK,
-	    ("vnode_pager_generic_getpages does not support devices"));
 	if (vp->v_iflag & VI_DOOMED)
 		return VM_PAGER_BAD;
 
+	object = vp->v_object;
+	foff = IDX_TO_OFF(m[0]->pindex);
+
+	KASSERT(foff < object->un_pager.vnp.vnp_size,
+	    ("%s: page %p offset beyond vp %p size", __func__, m[0], vp));
+
+	count = bytecount >> PAGE_SHIFT;
 	bsize = vp->v_mount->mnt_stat.f_iosize;
-	foff = IDX_TO_OFF(m[reqpage]->pindex);
 
 	/*
+	 * The last page has valid blocks.  Invalid part can only
+	 * exist at the end of file, and the page is made fully valid
+	 * by zeroing in vm_pager_get_pages().
+	 */
+	if (m[count - 1]->valid != 0) {
+		if ( --count == 0) {
+			if (iodone != NULL)
+				iodone(arg, m, 1, 0);
+			return (VM_PAGER_OK);
+		}
+		bytecount = count << PAGE_SHIFT;
+	}
+
+	/*
 	 * Synchronous and asynchronous paging operations use different
 	 * free pbuf counters.  This is done to avoid asynchronous requests
 	 * to consume all pbufs.
@@ -805,168 +786,55 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_
 	 * If the file system doesn't support VOP_BMAP, use old way of
 	 * getting pages via VOP_READ.
 	 */
-	error = VOP_BMAP(vp, foff / bsize, &bo, &reqblock, NULL, NULL);
+	error = VOP_BMAP(vp, foff / bsize, &bo, &reqblock, &after, NULL);
 	if (error == EOPNOTSUPP) {
 		relpbuf(bp, freecnt);
 		VM_OBJECT_WLOCK(object);
-		for (i = 0; i < count; i++)
-			if (i != reqpage) {
-				vm_page_lock(m[i]);
-				vm_page_free(m[i]);
-				vm_page_unlock(m[i]);
-			}
-		PCPU_INC(cnt.v_vnodein);
-		PCPU_INC(cnt.v_vnodepgsin);
-		error = vnode_pager_input_old(object, m[reqpage]);
+		for (i = 0; i < count; i++) {
+			PCPU_INC(cnt.v_vnodein);
+			PCPU_INC(cnt.v_vnodepgsin);
+			error = vnode_pager_input_old(object, m[i]);
+			if (error)
+				break;
+		}
 		VM_OBJECT_WUNLOCK(object);
 		return (error);
 	} else if (error != 0) {
 		relpbuf(bp, freecnt);
-		vm_pager_free_nonreq(object, m, reqpage, count, FALSE);
 		return (VM_PAGER_ERROR);
-
-		/*
-		 * if the blocksize is smaller than a page size, then use
-		 * special small filesystem code.  NFS sometimes has a small
-		 * blocksize, but it can handle large reads itself.
-		 */
-	} else if ((PAGE_SIZE / bsize) > 1 &&
-	    (vp->v_mount->mnt_stat.f_type != nfs_mount_type)) {
-		relpbuf(bp, freecnt);
-		vm_pager_free_nonreq(object, m, reqpage, count, FALSE);
-		PCPU_INC(cnt.v_vnodein);
-		PCPU_INC(cnt.v_vnodepgsin);
-		return vnode_pager_input_smlfs(object, m[reqpage]);
 	}
 
 	/*
-	 * Since the caller has busied the requested page, that page's valid
-	 * field will not be changed by other threads.
+	 * If the blocksize is smaller than a page size, then use
+	 * special small filesystem code.  NFS sometimes has a small
+	 * blocksize, but it can handle large reads itself.
 	 */
-	vm_page_assert_xbusied(m[reqpage]);
-
-	/*
-	 * If we have a completely valid page available to us, we can
-	 * clean up and return.  Otherwise we have to re-read the
-	 * media.
-	 */
-	if (m[reqpage]->valid == VM_PAGE_BITS_ALL) {
+	if ((PAGE_SIZE / bsize) > 1 &&
+	    (vp->v_mount->mnt_stat.f_type != nfs_mount_type)) {
 		relpbuf(bp, freecnt);
-		vm_pager_free_nonreq(object, m, reqpage, count, FALSE);
-		return (VM_PAGER_OK);
-	} else if (reqblock == -1) {
-		relpbuf(bp, freecnt);
-		pmap_zero_page(m[reqpage]);
-		KASSERT(m[reqpage]->dirty == 0,
-		    ("vnode_pager_generic_getpages: page %p is dirty", m));
-		VM_OBJECT_WLOCK(object);
-		m[reqpage]->valid = VM_PAGE_BITS_ALL;
-		vm_pager_free_nonreq(object, m, reqpage, count, TRUE);
-		VM_OBJECT_WUNLOCK(object);
-		return (VM_PAGER_OK);
-	} else if (m[reqpage]->valid != 0) {
-		VM_OBJECT_WLOCK(object);
-		m[reqpage]->valid = 0;
-		VM_OBJECT_WUNLOCK(object);
-	}
-
-	/*
-	 * here on direct device I/O
-	 */
-	firstaddr = -1;
-
-	/*
-	 * calculate the run that includes the required page
-	 */
-	for (first = 0, i = 0; i < count; i = runend) {
-		if (vnode_pager_addr(vp, IDX_TO_OFF(m[i]->pindex), &firstaddr,
-		    &runpg) != 0) {
-			relpbuf(bp, freecnt);
-			/* The requested page may be out of range. */
-			vm_pager_free_nonreq(object, m + i, reqpage - i,
-			    count - i, FALSE);
-			return (VM_PAGER_ERROR);
+		for (i = 0; i < count; i++) {
+			PCPU_INC(cnt.v_vnodein);
+			PCPU_INC(cnt.v_vnodepgsin);
+			error = vnode_pager_input_smlfs(object, m[i]);
+			if (error)
+				break;
 		}
-		if (firstaddr == -1) {
-			VM_OBJECT_WLOCK(object);
-			if (i == reqpage && foff < object->un_pager.vnp.vnp_size) {
-				panic("vnode_pager_getpages: unexpected missing page: firstaddr: %jd, foff: 0x%jx%08jx, vnp_size: 0x%jx%08jx",
-				    (intmax_t)firstaddr, (uintmax_t)(foff >> 32),
-				    (uintmax_t)foff,
-				    (uintmax_t)
-				    (object->un_pager.vnp.vnp_size >> 32),
-				    (uintmax_t)object->un_pager.vnp.vnp_size);
-			}
-			vm_page_lock(m[i]);
-			vm_page_free(m[i]);
-			vm_page_unlock(m[i]);
-			VM_OBJECT_WUNLOCK(object);
-			runend = i + 1;
-			first = runend;
-			continue;
-		}
-		runend = i + runpg;
-		if (runend <= reqpage) {
-			VM_OBJECT_WLOCK(object);
-			for (j = i; j < runend; j++) {
-				vm_page_lock(m[j]);
-				vm_page_free(m[j]);
-				vm_page_unlock(m[j]);
-			}
-			VM_OBJECT_WUNLOCK(object);
-		} else {
-			if (runpg < (count - first)) {
-				VM_OBJECT_WLOCK(object);
-				for (i = first + runpg; i < count; i++) {
-					vm_page_lock(m[i]);
-					vm_page_free(m[i]);
-					vm_page_unlock(m[i]);
-				}
-				VM_OBJECT_WUNLOCK(object);
-				count = first + runpg;
-			}
-			break;
-		}
-		first = runend;
+		return (error);
 	}
 
 	/*
-	 * the first and last page have been calculated now, move input pages
-	 * to be zero based...
+	 * Truncate bytecount to vnode real size and round up physical size
+	 * for real devices.
 	 */
-	if (first != 0) {
-		m += first;
-		count -= first;
-		reqpage -= first;
-	}
+	if ((foff + bytecount) > object->un_pager.vnp.vnp_size)
+		bytecount = object->un_pager.vnp.vnp_size - foff;
+	secmask = bo->bo_bsize - 1;
+	KASSERT(secmask < PAGE_SIZE && secmask > 0,
+	    ("%s: sector size %d too large", __func__, secmask + 1));
+	bytecount = (bytecount + secmask) & ~secmask;
 
 	/*
-	 * calculate the file virtual address for the transfer
-	 */
-	foff = IDX_TO_OFF(m[0]->pindex);
-
-	/*
-	 * calculate the size of the transfer
-	 */
-	size = count * PAGE_SIZE;
-	KASSERT(count > 0, ("zero count"));
-	if ((foff + size) > object->un_pager.vnp.vnp_size)
-		size = object->un_pager.vnp.vnp_size - foff;
-	KASSERT(size > 0, ("zero size"));
-
-	/*
-	 * round up physical size for real devices.
-	 */
-	if (1) {
-		int secmask = bo->bo_bsize - 1;
-		KASSERT(secmask < PAGE_SIZE && secmask > 0,
-		    ("vnode_pager_generic_getpages: sector size %d too large",
-		    secmask + 1));
-		size = (size + secmask) & ~secmask;
-	}
-
-	/*
-	 * and map the pages to be read into the kva, if the filesystem
+	 * And map the pages to be read into the kva, if the filesystem
 	 * requires mapped buffers.
 	 */
 	if ((vp->v_mount->mnt_kern_flag & MNTK_UNMAPPED_BUFS) != 0 &&
@@ -978,38 +846,33 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_
 		pmap_qenter((vm_offset_t)bp->b_data, m, count);
 	}
 
-	/* build a minimal buffer header */
+	/* Build a minimal buffer header. */
 	bp->b_iocmd = BIO_READ;
 	KASSERT(bp->b_rcred == NOCRED, ("leaking read ucred"));
 	KASSERT(bp->b_wcred == NOCRED, ("leaking write ucred"));
 	bp->b_rcred = crhold(curthread->td_ucred);
 	bp->b_wcred = crhold(curthread->td_ucred);
-	bp->b_blkno = firstaddr;
+	bp->b_blkno = reqblock + ((foff % bsize) / DEV_BSIZE);
 	pbgetbo(bo, bp);
 	bp->b_vp = vp;
-	bp->b_bcount = size;
-	bp->b_bufsize = size;
-	bp->b_runningbufspace = bp->b_bufsize;
+	bp->b_bcount = bp->b_bufsize = bp->b_runningbufspace = bytecount;
 	for (i = 0; i < count; i++)
 		bp->b_pages[i] = m[i];
 	bp->b_npages = count;
-	bp->b_pager.pg_reqpage = reqpage;
+	bp->b_iooffset = dbtob(bp->b_blkno);
+
 	atomic_add_long(&runningbufspace, bp->b_runningbufspace);
-
 	PCPU_INC(cnt.v_vnodein);
 	PCPU_ADD(cnt.v_vnodepgsin, count);
 
-	/* do the input */
-	bp->b_iooffset = dbtob(bp->b_blkno);
-
 	if (iodone != NULL) { /* async */
-		bp->b_pager.pg_iodone = iodone;
+		bp->b_pgiodone = iodone;
 		bp->b_caller1 = arg;
 		bp->b_iodone = vnode_pager_generic_getpages_done_async;
 		bp->b_flags |= B_ASYNC;
 		BUF_KERNPROC(bp);
 		bstrategy(bp);
-		/* Good bye! */
+		return (VM_PAGER_OK);
 	} else {
 		bp->b_iodone = bdone;
 		bstrategy(bp);
@@ -1020,9 +883,8 @@ vnode_pager_generic_getpages(struct vnode *vp, vm_
 		bp->b_vp = NULL;
 		pbrelbo(bp);
 		relpbuf(bp, &vnode_pbuf_freecnt);
+		return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK);
 	}
-
-	return (error != 0 ? VM_PAGER_ERROR : VM_PAGER_OK);
 }
 
 static void
@@ -1031,8 +893,7 @@ vnode_pager_generic_getpages_done_async(struct buf
 	int error;
 
 	error = vnode_pager_generic_getpages_done(bp);
-	bp->b_pager.pg_iodone(bp->b_caller1, bp->b_pages,
-	  bp->b_pager.pg_reqpage, error);
+	bp->b_pgiodone(bp->b_caller1, bp->b_pages, bp->b_npages, error);
 	for (int i = 0; i < bp->b_npages; i++)
 		bp->b_pages[i] = NULL;
 	bp->b_vp = NULL;
@@ -1095,9 +956,6 @@ vnode_pager_generic_getpages_done(struct buf *bp)
 			    object->un_pager.vnp.vnp_size - tfoff)) == 0,
 			    ("%s: page %p is dirty", __func__, mt));
 		}
-		
-		if (i != bp->b_pager.pg_reqpage)
-			vm_page_readahead_finish(mt);
 	}
 	VM_OBJECT_WUNLOCK(object);
 	if (error != 0)
Index: sys/vm/vnode_pager.h
===================================================================
--- sys/vm/vnode_pager.h	(revision 286413)
+++ sys/vm/vnode_pager.h	(working copy)
@@ -41,7 +41,7 @@
 #ifdef _KERNEL
 
 int vnode_pager_generic_getpages(struct vnode *vp, vm_page_t *m,
-    int count, int reqpage, vop_getpages_iodone_t iodone, void *arg);
+    int count, vop_getpages_iodone_t iodone, void *arg);
 int vnode_pager_generic_putpages(struct vnode *vp, vm_page_t *m,
 					  int count, boolean_t sync,
 					  int *rtvals);

--sLx0z+5FKKtIVDwd--

From owner-freebsd-arch@freebsd.org  Sat Aug  8 08:41:28 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B88289B3396
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Sat,  8 Aug 2015 08:41:28 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org
 [IPv6:2001:1900:2254:206a::50:5])
 by mx1.freebsd.org (Postfix) with ESMTP id 9D0AB1997
 for <freebsd-arch@freebsd.org>; Sat,  8 Aug 2015 08:41:28 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: by mailman.ysv.freebsd.org (Postfix)
 id 9BF379B3395; Sat,  8 Aug 2015 08:41:28 +0000 (UTC)
Delivered-To: arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 81EED9B3394
 for <arch@mailman.ysv.freebsd.org>; Sat,  8 Aug 2015 08:41:28 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 29B7A1994;
 Sat,  8 Aug 2015 08:41:27 +0000 (UTC)
 (envelope-from kostikbel@gmail.com)
Received: from tom.home (kostik@localhost [127.0.0.1])
 by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id t788fLuN081072
 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Sat, 8 Aug 2015 11:41:22 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua t788fLuN081072
Received: (from kostik@localhost)
 by tom.home (8.15.2/8.15.2/Submit) id t788fLuD081071;
 Sat, 8 Aug 2015 11:41:21 +0300 (EEST)
 (envelope-from kostikbel@gmail.com)
X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com
 using -f
Date: Sat, 8 Aug 2015 11:41:21 +0300
From: Konstantin Belousov <kostikbel@gmail.com>
To: Gleb Smirnoff <glebius@FreeBSD.org>
Cc: arch@FreeBSD.org, alc@freebsd.org
Subject: Re: more strict KPI for vm_pager_get_pages()
Message-ID: <20150808084121.GX2072@kib.kiev.ua>
References: <20150430142408.GS546@nginx.com> <20150807133844.GS889@FreeBSD.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150807133844.GS889@FreeBSD.org>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00,
 DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no
 autolearn_force=no version=3.4.1
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 08 Aug 2015 08:41:28 -0000

On Fri, Aug 07, 2015 at 04:38:44PM +0300, Gleb Smirnoff wrote:
>   Hi!
> 
>   This is followup on older email:
> 
> https://lists.freebsd.org/pipermail/freebsd-arch/2015-April/017154.html
> 
> The preparatory commits were already checked in, and I'm going
> to push next week the rest, since the whole story is already
> several months due.
> 
> Planned changes:
> 
> o vm_pager_get_pages() accepts array of pages, and treats all pages
>   equally. Notion of reqpage goes away.
> o The array span validity must be checked before with vm_pager_has_page().
> o All pages must be xbusied on enter.
> o All pages will be left xbusied on exit. This closes possible races,
>   allows to pass in wired pages (for any pager). And it leaves
>   the caller to decide what to do with pages: vm_page_active,
>   vm_page_deactivate, vm_page_flash or just vm_page_free them.
> 
> The patch has been tested by me and pho@ with his stress2 test.
> 
> I know, that there are two comments from kib@ on the patch.
These were not comments, but objections.

> 
> 1) There could be a user of KPI who would be fine with partial success.
> 
> My answer: right now there is none, and if one emerges, the code can
> be easily adopted to return VM_PAGER_ERROR, but still mark validated
> pages as valid. The user of KPI then can scan the array and take valid
> pages. So, the patch doesn't put any obstacles on appearance of such
> user.
The vm_fault.c is already fine with the partial success, it only cares
that the requested page was validated and no error from pager is
returned.

> 
> 2) Filesystems can do short reads by design, and thus fail to validate
>    the entire array.
> 
> My answer: yes, that's true. By design NFS, SMBFS and FUSE should be
> able to return short reads. However, the VOP_GETPAGES methods for all
> three FSes right now do not have any code that would support that. So,
> it looks like there is an open issue with these filesystems, not related
> to my patch. When this issue is addressed in any of aforementioned FSes,
> the VOP_GETPAGES should be fixed to do several I/Os in case of short
> reads.

And this is a bug in the networking filesystems (most likely).  Rick was
asked about NFS, but he did not responded.  You are proposing to make the
bug a part of the interface.

I object against this change. It is wrong philosophically, and it
encodes the incomplete or accidental behaviour of several filesystems at
the interface level.

In fact, you are taking some rather secondary feature of the current
interface, that the non-requested pages are made busy for the duration
of the paging request, to the level of the fundamental property (would
the non-mreq pages be not busied, proposed change immediately causes
applications segfault on parallel file truncation, or makes user data
corrupted, for example).

All this rototilling is because you do not want to code the proper FSA
in your reworked sendfile patch.

I object against the change and against the reasoning behind it.