From owner-freebsd-net@freebsd.org Thu Jul 30 02:04:20 2015 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D74619A71FB for ; Thu, 30 Jul 2015 02:04:20 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id AF61261C for ; Thu, 30 Jul 2015 02:04:20 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (75-48-78-19.lightspeed.cncrca.sbcglobal.net [75.48.78.19]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 4F96DB979; Wed, 29 Jul 2015 22:04:19 -0400 (EDT) From: John Baldwin To: Laurie Jennings Cc: freebsd-net@freebsd.org Subject: Re: Locking Memory Question Date: Wed, 29 Jul 2015 19:03:31 -0700 Message-ID: <179784785.yfa5UNM2qp@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-PRERELEASE; KDE/4.14.3; amd64; ; ) In-Reply-To: <1438208806.66724.YahooMailBasic@web141505.mail.bf1.yahoo.com> References: <1438208806.66724.YahooMailBasic@web141505.mail.bf1.yahoo.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 29 Jul 2015 22:04:19 -0400 (EDT) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jul 2015 02:04:21 -0000 On Wednesday, July 29, 2015 03:26:46 PM Laurie Jennings wrote: > > I have a problem and I can't quite figure out where to look. This is what Im doing: > > I have an IOCTL to read a block of data, but the data is too large to return via ioctl. So to get the data, > I allocate a block in a kernel module: > > foo = malloc(1024000,M_DEVBUF,M_WAITOK); > > I pass up a pointer and in user space map it using /dev/kmem: > > fd = open("/dev/kmem",O_RDWR); > if (fd > 0){ > memp = mmap(0,1024000,PROT_READ,MAP_SHARED,fd,p); > > and then grab the data from memp (It's a stringified object). > > The problem is that sometimes it's garbage. 95% of the time it works fine. I figured that the memory wasn't wired and I've been trying to wire > it but not having much success. kmem_alloc() and kmem_malloc() panic in vm_map_lock, so Im guessing that you can't do this in an IOCTL call? > > So my questions: > > 1) Shouldn't kmem mapped memory be wired? how else could you reliable read kernel memory? The memory from malloc() is wired, so this should be fine even if a bit hackish by requiring access to /dev/kmem. However, mmap for character devices is a bit special. In particular, character devices cache mappings established via d_mmap forever. This is particularly bad for /dev/kmem. Specifically, once some process mmap's offset X in /dev/kmem, the associated physical address P for 'X' is cached forever and will not be updated even if the kernel memory at address X is free'd and later reallocated for something else using a different physical page. So if your driver created one of these objects at boot and never freed it, then /dev/kmem mappings will probably work fine. However, if your driver creates these and frees them, but then creates another one later, if the second object reuses the same addresses (or some of them) but with different backing pages, your second userland mapping will end up using the physical pages from the first mapping. /dev/kmem isn't really well suited to mmap() for this reason. > 5) What does MAP_PREFAULT_READ do and would it solve this problem? It will not help with this. All that does is pre-create PTEs in the user process for any physical pages that are already in RAM, though I'm not sure it actually does anything for character device mappings (e.g. via /dev/kmem). There are a few ways to do what you want that should work more reliably. In general they all consist of creating a VM object to create/describe the buffer you care about. One option (used by the nvidia driver and that I've used in other drivers) is to allocate wired memory in the kernel either via contigmalloc() or malloc() and then create an sglist that describes the buffer. You can then create an OBJT_SG VM object that is "backed" by the sglist and allow userland to map this object via a d_mmap_single() callback on a character device. For example: /* Error handling elided for simplicity */ struct foo_userbuf { void *mem; vm_object_t obj; }; int foo_create_userbuf(struct foo_userbuf *ub, size_t len) { struct sglist *sg; /* M_ZERO to not leak anything to userland. */ ub->mem = malloc(len, M_DEVBUF, M_WAITOK | M_ZERO); sg = sglist_build(ub->mem, len, M_WAITOK); ub->obj = vm_pager_allocate(OBJT_SG, sg, len, VM_PROT_READ | VM_PROT_WRITE, 0); /* ub->obj now "owns" the sglist via an internal reference. */ } int foo_destroy_userbuf(struct foo_userbuf *ub) { /* * Note well: this does _not_ block waiting for other * references to be dropped, etc. */ vm_object_deallocate(ub->obj); /* * Specifically, this next step is only safe if you * _know_ that there are no other mappings, which * might be quite hard to do. */ free(ub->mem, M_DEVBUF); } int foo_ioctl(....) { switch (cmd) { case GIVE_ME_A_BUFFER: .... foo_create_userbuf(ub); /* * Return some sort of "key" identifying "ub" to * userland. */ } } int foo_mmap_single(struct cdev *dev, vm_ooffset_t *offset, vm_size_t size, vm_object_t *object, int nprot) { /* * You will need some sort of way to identify different * buffers if you use more than one. For example, you * might use the offset passed to mmap as the identifer. * Keep in mind that the address passed to this routine * is page aligned, so you cannot "see" any low bits in * the address and can't use those as part of your key. */ ub = foo_lookup_userbuf(*offset); /* * Clear the offset to zero as it will now be relative to * the object we are returning. */ *offset = 0; vm_object_reference(ub->obj); *object = ub->obj; } A second option you can use is to instead have userland allocate a VM object via shm_open() and then map that into the kernel in your driver. There are helper routines in uipc_shm.c to facilitate this that I've used in some out-of-tree code before. Something like this: Header: struct foo_mapbuf { int fd; size_t len; }; User code: struct foo_mapbuf mb; int devfd; devfd = open("/dev/foo", ....); ioctl(devfd, BUFFER_SIZE, &mb.len); mb.fd = shm_open(SHM_ANON, O_RDWR, 0600); ftruncate(mb.fd, mb.len); ioctl(devfd, MAP_BUFFER, &mb); p = mmap(..., mb.fd, ...); Driver code: struct foo_userbuf { struct file *fp; void *kmem; size_t size; }; int foo_ioctl(...) { struct foo_mapbuf *mb; struct foo_userbuf *ub; switch (cmd) { case BUFFER_SIZE: /* return desired size */ case MAP_BUFFER: mb = (struct foo_mapbuf *)data; ub = ; /* fget takes a few more parameters you'll need to work out */ ub->fp = fget(mb->fd); ub->size = mb->size; /* This assumes a starting offset of 0 */ shm_map(ub->fp, ub->size, 0, &ub->mem); /* * Can now access buffer in kernel via 'ub->mem' * pointer, and pages are wired until released * by a call to shm_unmap(). */ } You will want some way to handle unmapping the buffer either via devfs_priv dtor method, d_close or something else to avoid leaking the kernel mappings. This has the advantage over the first approach that it will keep the pages around until all mappings are gone, though once all the kernel mappings are gone the pages will no longer be wired (though they will be swap-backed). -- John Baldwin