From owner-freebsd-net@freebsd.org  Thu Jul 30 02:04:20 2015
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D74619A71FB
 for <freebsd-net@mailman.ysv.freebsd.org>;
 Thu, 30 Jul 2015 02:04:20 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id AF61261C
 for <freebsd-net@freebsd.org>; Thu, 30 Jul 2015 02:04:20 +0000 (UTC)
 (envelope-from jhb@freebsd.org)
Received: from ralph.baldwin.cx (75-48-78-19.lightspeed.cncrca.sbcglobal.net
 [75.48.78.19])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id 4F96DB979;
 Wed, 29 Jul 2015 22:04:19 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Laurie Jennings <laurie_jennings_1977@yahoo.com>
Cc: freebsd-net@freebsd.org
Subject: Re: Locking Memory Question
Date: Wed, 29 Jul 2015 19:03:31 -0700
Message-ID: <179784785.yfa5UNM2qp@ralph.baldwin.cx>
User-Agent: KMail/4.14.3 (FreeBSD/10.2-PRERELEASE; KDE/4.14.3; amd64; ; )
In-Reply-To: <1438208806.66724.YahooMailBasic@web141505.mail.bf1.yahoo.com>
References: <1438208806.66724.YahooMailBasic@web141505.mail.bf1.yahoo.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Wed, 29 Jul 2015 22:04:19 -0400 (EDT)
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jul 2015 02:04:21 -0000

On Wednesday, July 29, 2015 03:26:46 PM Laurie Jennings wrote:
> 
> I have a problem and I can't quite figure out where to look. This is what Im doing:
> 
> I have an IOCTL to read a block of data, but the data is too large to return via ioctl. So to get the data,
> I allocate a block in a kernel module:
> 
> foo = malloc(1024000,M_DEVBUF,M_WAITOK);
> 
>  I pass up a pointer and in user space map it using /dev/kmem:
> 
> fd = open("/dev/kmem",O_RDWR);
>                 if (fd > 0){
>                         memp = mmap(0,1024000,PROT_READ,MAP_SHARED,fd,p);
> 
> and then grab the data from memp (It's a stringified object). 
> 
> The problem is that sometimes it's garbage. 95% of the time it works fine.  I figured that the memory wasn't wired and I've been trying to wire
> it but not having much success. kmem_alloc() and kmem_malloc() panic in vm_map_lock, so Im guessing that you can't do this in an IOCTL call?
> 
> So my questions:
> 
> 1) Shouldn't kmem mapped memory be wired? how else could you reliable read kernel memory?

The memory from malloc() is wired, so this should be fine even if a bit
hackish by requiring access to /dev/kmem.

However, mmap for character devices is a bit special.  In particular,
character devices cache mappings established via d_mmap forever.  This
is particularly bad for /dev/kmem.  Specifically, once some process
mmap's offset X in /dev/kmem, the associated physical address P for 'X'
is cached forever and will not be updated even if the kernel memory
at address X is free'd and later reallocated for something else using
a different physical page.  So if your driver created one of these
objects at boot and never freed it, then /dev/kmem mappings will
probably work fine.  However, if your driver creates these and frees
them, but then creates another one later, if the second object
reuses the same addresses (or some of them) but with different
backing pages, your second userland mapping will end up using the
physical pages from the first mapping.  /dev/kmem isn't really well
suited to mmap() for this reason.

> 5) What does MAP_PREFAULT_READ do and would it solve this problem?

It will not help with this.  All that does is pre-create PTEs in the
user process for any physical pages that are already in RAM, though
I'm not sure it actually does anything for character device
mappings (e.g. via /dev/kmem).

There are a few ways to do what you want that should work more
reliably.  In general they all consist of creating a VM object to
create/describe the buffer you care about.

One option (used by the nvidia driver and that I've used in other
drivers) is to allocate wired memory in the kernel either via
contigmalloc() or malloc() and then create an sglist that describes
the buffer.  You can then create an OBJT_SG VM object that is "backed"
by the sglist and allow userland to map this object via a
d_mmap_single() callback on a character device.  For example:

/* Error handling elided for simplicity */

struct foo_userbuf {
	void *mem;
	vm_object_t obj;
};

int
foo_create_userbuf(struct foo_userbuf *ub, size_t len)
{
	struct sglist *sg;
	
	/* M_ZERO to not leak anything to userland. */
	ub->mem = malloc(len, M_DEVBUF, M_WAITOK | M_ZERO);
	sg = sglist_build(ub->mem, len, M_WAITOK);
	ub->obj = vm_pager_allocate(OBJT_SG, sg, len,
	    VM_PROT_READ | VM_PROT_WRITE, 0);

	/* ub->obj now "owns" the sglist via an internal reference. */
}

int
foo_destroy_userbuf(struct foo_userbuf *ub)
{

	/*
	 * Note well: this does _not_ block waiting for other
	 * references to be dropped, etc.
	 */
	vm_object_deallocate(ub->obj);

	/*
	 * Specifically, this next step is only safe if you
	 * _know_ that there are no other mappings, which
	 * might be quite hard to do.
	 */
	free(ub->mem, M_DEVBUF);
}

int
foo_ioctl(....)
{

	switch (cmd) {
	case GIVE_ME_A_BUFFER:
		....
		foo_create_userbuf(ub);
		/*
		 * Return some sort of "key" identifying "ub" to
		 * userland.
		 */
	}
}

int
foo_mmap_single(struct cdev *dev, vm_ooffset_t *offset, vm_size_t size,
    vm_object_t *object, int nprot)
{

	/*
	 * You will need some sort of way to identify different
	 * buffers if you use more than one.  For example, you
	 * might use the offset passed to mmap as the identifer.
	 * Keep in mind that the address passed to this routine
	 * is page aligned, so you cannot "see" any low bits in
	 * the address and can't use those as part of your key.
	 */

	ub = foo_lookup_userbuf(*offset);
	
	/*
	 * Clear the offset to zero as it will now be relative to
	 * the object we are returning.
	 */
	*offset = 0;
	vm_object_reference(ub->obj);
	*object = ub->obj;
}

A second option you can use is to instead have userland allocate a
VM object via shm_open() and then map that into the kernel in your
driver.  There are helper routines in uipc_shm.c to facilitate this
that I've used in some out-of-tree code before.  Something like this:

Header:

struct foo_mapbuf {
	int fd;
	size_t len;
};

User code:

	struct foo_mapbuf mb;
	int devfd;

	devfd = open("/dev/foo", ....);
	ioctl(devfd, BUFFER_SIZE, &mb.len);

	mb.fd = shm_open(SHM_ANON, O_RDWR, 0600);
	ftruncate(mb.fd, mb.len);

	ioctl(devfd, MAP_BUFFER, &mb);

	p = mmap(..., mb.fd, ...);

Driver code:

struct foo_userbuf {
	struct file *fp;
	void *kmem;
	size_t size;
};

int
foo_ioctl(...)
{
	struct foo_mapbuf *mb;
	struct foo_userbuf *ub;

	switch (cmd) {
	case BUFFER_SIZE:
		/* return desired size */
	case MAP_BUFFER:
		mb = (struct foo_mapbuf *)data;
		
		ub = <alloc a userbuf struct>;

		/* fget takes a few more parameters you'll need to work out */
		ub->fp = fget(mb->fd);
		ub->size = mb->size;

		/* This assumes a starting offset of 0 */
		shm_map(ub->fp, ub->size, 0, &ub->mem);

		/*
		 * Can now access buffer in kernel via 'ub->mem'
		 * pointer, and pages are wired until released
		 * by a call to shm_unmap().
		 */
}

You will want some way to handle unmapping the buffer either via devfs_priv
dtor method, d_close or something else to avoid leaking the kernel mappings.
This has the advantage over the first approach that it will keep the pages
around until all mappings are gone, though once all the kernel mappings are
gone the pages will no longer be wired (though they will be swap-backed).

-- 
John Baldwin