From owner-freebsd-hackers Thu Nov 8 14:41:58 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from raven.mail.pas.earthlink.net (raven.mail.pas.earthlink.net [207.217.120.39]) by hub.freebsd.org (Postfix) with ESMTP id 879DD37B41B for ; Thu, 8 Nov 2001 14:41:49 -0800 (PST) Received: from dialup-209.245.143.27.dial1.sanjose1.level3.net ([209.245.143.27] helo=mindspring.com) by raven.mail.pas.earthlink.net with esmtp (Exim 3.33 #1) id 161xri-0000RZ-00; Thu, 08 Nov 2001 14:41:43 -0800 Message-ID: <3BEB0A57.3C510C49@mindspring.com> Date: Thu, 08 Nov 2001 14:42:31 -0800 From: Terry Lambert Reply-To: tlambert2@mindspring.com X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony} (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Jason Mawdsley Cc: mark tinguely , bright@mu.org, freebsd-hackers@FreeBSD.ORG Subject: Re: mmap/madvise References: <200111081947.fA8JlAe03457@web.cs.ndsu.nodak.edu> <02ae01c16891$4c1f4970$2a64a8c0@macadamian.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Jason Mawdsley wrote: > > > Jason Mawdsley asks: > > > > > > I am looking for a way to reserve memory, without actually > > > > allocating the swap space. > > > > Alfred Perlstein answers: > > > > > Just proceed normally, freebsd does overcommit such that you really > > > don't need to do anything special to get the results you desire. > > > > I assume Jason is writting a userland application, but I cannot tell > > how he was using the allocated memory. Alfred is correct in that > > allocated memory is not even physical until needed and only paged back > > if modified AND space becomes low. > > > > Without information of what he was doing, I was trying to read between > > the lines of his message and wonder if he needs the memory physically > > there and wired (using mprotect) to prevent the memory from being > released. > > I am creating a virtual memory manager. > > Currently I am doing a > mmap(...PROT_NONE, MAP_ANON ) to reserve the memory. > then when committing the memory I am using mprotect( ...PROT_READ | > PROT_WRITE ) Up front, given your company web pages, I suspect that you are trying to port Windows code to Linux, but that the Linux community has been less than helpful, so you are turning to us. Hopefully, if we a re helpful, you will seriously consider supporting FreeBSD as a platform for your product. --- First, you aren't doing what you think you are doing, even under Windows, since the function you are using doesn't really work completely like you think it works. However inefficient it is, though, you can at least get work done with it, so whatever. Here is the nasty function: LPVOID VirtualAlloc( LPVOID lpAddress, // region to reserve or commit SIZE_T dwSize, // size of region DWORD flAllocationType, // type of allocation DWORD flProtect // type of access protection ); --- I started to write a glue function that looked like VirtualAlloc, but when I got about 3/4 of the way done, I realized that the performance penalty would be large enough that it would be a really (, really, really) bad idea. Instead, you will want to relook at the problem you are attempting to solve a bit, and come up with a better solution. Right now, you are having to look at the code that calls VirtualAlloc in your product, so you are on the right track for doing that, so it is no more hairy to (mostly) do the right thing than to write a glue function. Since I don't know how you are _precisely_ using VirtualAlloc, I'm going to cover the UNIX bases for you... --- The manual pages you want to look at are: mmap For reservation of memory; you should mmap the fd for /dev/zero, with MAP_ANON to grab pages initially. munmap Needed for PAGE_NO_ACCESS; implicated in PAGE_GUARD and MEM_WRITE_WATCH (see below). msync For MEM_COMMIT. The Windows documentation is actually misleading, since a MEM_COMMIT of previously allocated memory does not result in it being zeroed, like it will be if you use MEM_COMMIT on a region that has not been previous MEM_RESERVE or MEM_PHYSICAL flagged. Note: Unfortunately FreeBSD msync() will write more data than it needs to, since the dirty pages in mapped regions are not tracked to the granualrity necessary to be able to write only them, without an expensive reverse page table lookup. So this will generally be mor expensive than it should be. madvise Use to get MEM_RESET functionality; also used to change the protections. Changing the protections _DOES NOT COMMIT_ dirty blocks, as you appear to be assuming. Other caveats: o The round down will be to the next 4k, not the next 64k boundary; FreeBSD doesn't have to deal with segments, like Windows 98/ME, so there is no real efficiency reason for doing the rounding, as in Windows. If you depend on this, you will need to manually take care of this (note: "dwSize will need to go up if you make the address go down, so be prepared). o FreeBSD does not support AWE, so if you are trying to use this to get more than 4G of physical memory, you aren't going to be able to do it. o The MEM_TOP_DOWN (NT/2000/XP specific) flag can only be simulated by putting logic into your allocation, such that a NULL address (system picks) is translated to a high non-NULL address, in order to force the behaviour. o The MEM_WRITE_WATCH (98/ME specific) flag can only be simulated by mapping the memory read-only, taking the fault in a signal handler, remapping the region to permit the write, redoing the write, remapping the reagion read-only, and then logging the information into a page map, so that you can write your own GetWriteWate and ResetWriteWatch functions to deal with the logged data. It's also likely that, given the way Windows swap files work, you would be a hell of a lot better off allocating disk space for the memory regions, mmaping real files, instead of anonymous pages, and then touching each of the blocks to ensure that the allocated region had disk allocated to it as well. Doing this will use file space instead of swap space for the backing store, and give you better control over the data, as well as persistance, should you need it later. In the case where you are trying to simulate the MAP_PHYSICAL case, or the MAP_COMMIT allocation case, you will want to touch every new page to make sure it will be committed by msync, so you will need to make it "dirty" by writing the first byte (all newly allocated pages are zero'ed, so you can just write a zero into the first byte to do this; for dirty pages containing data, they only need to be written if they are truly dirty -- if the backing store does not contain the same data as memory. This is automatic, so all you should need for them is to msync() the region. If you don't dirty the pages, then the swap (or the backing files you create, if you failed to touch each block before you created them) will end up being sparse, and you may run into a situation where the amount of available backing store is not as lage as the dirty region when it comes time to write dirty pages out. If that happens, then you will segfault when the system runs out of memory (this is the overcommit that Alfred talked about). If you want to precommit all your memory (and it sounds like you do), then you will want to dirty every page before someone else tries to allocate memory, so that they run out instead of you running out. Again, the use of files where you dirty every block to ensure they are not sparse, is the easiest way to go. As far as PAGE_EXECUTE, the Intel processors make no distinction between exeuctable and not readable ad executable and readable; you can get this behaviour through some serious gymnastics, but the last time I was in the Windows VM system, they didn't make this distinction either. So don't expect to be able to make executable pages non-readable. THe PAGE_GUARD behaviour is to set up an unallocated page to ensure that the page does not become valid, so that if you run off the end of a region, it signals an exception instead of merrily writing memory it ought not to. UNIX systems do not have the concept of an map'ed guard page. You could do this with a read-only page, but that would only guarg against writes, not read attempts. The best way to do this in UNIX is not to map it. So if you are doing a general routine for doing this, you will want to explicitly tract allocations -- _and_ any non-allocations -- and then use a non-allocation to indicate a that subsequent calls will not result in an allocation. Again, you will then have to deal with the segmentation fault signal handler to detect this (note: you will need to reset the handler each time it happens!). This will take the place of your STATUS_GUARD_PAGE system exception; since guard pages are one-shot in Windows, it implies that they are applied to mapped regions. This means that you will have to put something in place of the page, or the next time the access occurs, if you did not reset the signal handler, it will result in a fault. For this to work, you actually need to do the fixup described earlier (unde MEM_WRITE_WATCH), and you will need to map a real page in, and replace it, and restart the action that caused the "guard" fault. The main problem with this is multiple consecutive write attempts in a region, with the fault handler ebing reset each time. Windows can't do this quickly enough to trap all attempts, so it's probably OK to have this on a page boundary. Again, since the page will have to come from backing store somewhere (or be written), and you are attempting to avoid overcommit, you are best served by using the file, and having the page doubly mapped, so that it's in a hidden "guard region, but not mapped to the area, so that when you need to map it, it's actually there, and you don't get a resource related error. I recommend against PAGE_NOCACHE. You can use madvise(2) to get the same effect, but only on anonymous memory backed by swap. For file backed regions, you can't make a file sparse once it is non-sparse: FreeBSD doesn't support the fcntl(2) method of ftruncate(2), which would be capable of releasing reagions of disk blocks, and replacing the dirct or indirect block references with zero, making the zero'ed pages on read, and non-sparse on write. Linux, Solaris, and SVR4 support this, so if you are porting to multiple platforms, you might want to use files anyway, and just note that FreeBSD files will never become spare once non-sparse, and will continue to consumes chunks of disk space, as a result. My personal recommendation is that you use this information to redesign, in part or in whole, your memory management code, such that it is more suited to efficient operation on UNIX platforms. If all you need is to get it running, however, then this should be enough information for you to get there. Too bad there isn't an "Advance UNIX programming for Windows Programmers" book. 8-(. Good luck with your project... -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message