From owner-freebsd-hackers  Thu Nov  8 14:41:58 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from raven.mail.pas.earthlink.net (raven.mail.pas.earthlink.net [207.217.120.39])
	by hub.freebsd.org (Postfix) with ESMTP id 879DD37B41B
	for <freebsd-hackers@freebsd.org>; Thu,  8 Nov 2001 14:41:49 -0800 (PST)
Received: from dialup-209.245.143.27.dial1.sanjose1.level3.net ([209.245.143.27] helo=mindspring.com)
	by raven.mail.pas.earthlink.net with esmtp (Exim 3.33 #1)
	id 161xri-0000RZ-00; Thu, 08 Nov 2001 14:41:43 -0800
Message-ID: <3BEB0A57.3C510C49@mindspring.com>
Date: Thu, 08 Nov 2001 14:42:31 -0800
From: Terry Lambert <tlambert2@mindspring.com>
Reply-To: tlambert2@mindspring.com
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD {Sony}  (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Jason Mawdsley <jason@macadamian.com>
Cc: mark tinguely <tinguely@web.cs.ndsu.nodak.edu>, bright@mu.org,
	freebsd-hackers@FreeBSD.ORG
Subject: Re: mmap/madvise
References: <200111081947.fA8JlAe03457@web.cs.ndsu.nodak.edu> <02ae01c16891$4c1f4970$2a64a8c0@macadamian.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Jason Mawdsley wrote:
> 
> > Jason Mawdsley <jason@macadamian.com> asks:
> >
> > >  > I am looking for a way to reserve memory, without actually
> > >  > allocating the swap space.
> >
> > Alfred Perlstein answers:
> >
> > >  Just proceed normally, freebsd does overcommit such that you really
> > >  don't need to do anything special to get the results you desire.
> >
> > I assume Jason is writting a userland application, but I cannot tell
> > how he was using the allocated memory. Alfred is correct in that
> > allocated memory is not even physical until needed and only paged back
> > if modified AND space becomes low.
> >
> > Without information of what he was doing, I was trying to read between
> > the lines of his message and wonder if he needs the memory physically
> > there and wired (using mprotect) to prevent the memory from being
> released.
> 
> I am creating a virtual memory manager.
> 
> Currently I am doing a
> mmap(...PROT_NONE, MAP_ANON ) to reserve the memory.
> then when committing the memory I am using mprotect( ...PROT_READ |
> PROT_WRITE )


Up front, given your company web pages, I suspect that you are
trying to port Windows code to Linux, but that the Linux community
has been less than helpful, so you are turning to us.  Hopefully,
if we a re helpful, you will seriously consider supporting FreeBSD
as a platform for your product.

---

First, you aren't doing what you think you are doing, even under
Windows, since the function you are using doesn't really work
completely like you think it works.  However inefficient it is,
though, you can at least get work done with it, so whatever.

Here is the nasty function:

LPVOID VirtualAlloc(
  LPVOID lpAddress,        // region to reserve or commit
  SIZE_T dwSize,           // size of region
  DWORD flAllocationType,  // type of allocation
  DWORD flProtect          // type of access protection
);

---

I started to write a glue function that looked like VirtualAlloc,
but when I got about 3/4 of the way done, I realized that the
performance penalty would be large enough that it would be a
really (, really, really) bad idea.

Instead, you will want to relook at the problem you are attempting
to solve a bit, and come up with a better solution.  Right now,
you are having to look at the code that calls VirtualAlloc in your
product, so you are on the right track for doing that, so it is no
more hairy to (mostly) do the right thing than to write a glue
function.

Since I don't know how you are _precisely_ using VirtualAlloc, I'm
going to cover the UNIX bases for you...

---

The manual pages you want to look at are:

	mmap		For reservation of memory; you should
			mmap the fd for /dev/zero, with MAP_ANON
			to grab pages initially.

	munmap		Needed for PAGE_NO_ACCESS; implicated in
			PAGE_GUARD and MEM_WRITE_WATCH (see below).

	msync		For MEM_COMMIT.  The Windows documentation
			is actually misleading, since a MEM_COMMIT
			of previously allocated memory does not
			result in it being zeroed, like it will be
			if you use MEM_COMMIT on a region that has
			not been previous MEM_RESERVE or MEM_PHYSICAL
			flagged.

			Note:	Unfortunately FreeBSD msync() will
				write more data than it needs to,
				since the dirty pages in mapped
				regions are not tracked to the
				granualrity necessary to be able to
				write only them, without an expensive
				reverse page table lookup.  So this
				will generally be mor expensive than
				it should be.

	madvise		Use to get MEM_RESET functionality; also used
			to change the protections.  Changing the
			protections _DOES NOT COMMIT_ dirty blocks,
			as you appear to be assuming.

Other caveats:

	o	The round down will be to the next 4k, not the next
		64k boundary; FreeBSD doesn't have to deal with
		segments, like Windows 98/ME, so there is no real
		efficiency reason for doing the rounding, as in
		Windows.  If you depend on this, you will need to
		manually take care of this (note: "dwSize will need
		to go up if you make the address go down, so be
		prepared).

	o	FreeBSD does not support AWE, so if you are trying to
		use this to get more than 4G of physical memory, you
		aren't going to be able to do it.

	o	The MEM_TOP_DOWN (NT/2000/XP specific) flag can only
		be simulated by putting logic into your allocation,
		such that a NULL address (system picks) is translated
		to a high non-NULL address, in order to force the
		behaviour.

	o	The MEM_WRITE_WATCH (98/ME specific) flag can only
		be simulated by mapping the memory read-only, taking
		the fault in a signal handler, remapping the region
		to permit the write, redoing the write, remapping the
		reagion read-only, and then logging the information
		into a page map, so that you can write your own
		GetWriteWate and ResetWriteWatch functions to deal
		with the logged data.

It's also likely that, given the way Windows swap files work, you
would be a hell of a lot better off allocating disk space for the
memory regions, mmaping real files, instead of anonymous pages, and
then touching each of the blocks to ensure that the allocated region
had disk allocated to it as well.  Doing this will use file space
instead of swap space for the backing store, and give you better
control over the data, as well as persistance, should you need it
later.

In the case where you are trying to simulate the MAP_PHYSICAL case,
or the MAP_COMMIT allocation case, you will want to touch every new
page to make sure it will be committed by msync, so you will need
to make it "dirty" by writing the first byte (all newly allocated
pages are zero'ed, so you can just write a zero into the first byte
to do this; for dirty pages containing data, they only need to be
written if they are truly dirty -- if the backing store does not
contain the same data as memory.  This is automatic, so all you
should need for them is to msync() the region.

If you don't dirty the pages, then the swap (or the backing files
you create, if you failed to touch each block before you created
them) will end up being sparse, and you may run into a situation
where the amount of available backing store is not as lage as the
dirty region when it comes time to write dirty pages out.  If that
happens, then you will segfault when the system runs out of memory
(this is the overcommit that Alfred talked about).  If you want to
precommit all your memory (and it sounds like you do), then you
will want to dirty every page before someone else tries to allocate
memory, so that they run out instead of you running out.  Again,
the use of files where you dirty every block to ensure they are not
sparse, is the easiest way to go.

As far as PAGE_EXECUTE, the Intel processors make no distinction
between exeuctable and not readable ad executable and readable;
you can get this behaviour through some serious gymnastics, but the
last time I was in the Windows VM system, they didn't make this
distinction either.  So don't expect to be able to make executable
pages non-readable.

THe PAGE_GUARD behaviour is to set up an unallocated page to
ensure that the page does not become valid, so that if you run
off the end of a region, it signals an exception instead of
merrily writing memory it ought not to.  UNIX systems do not
have the concept of an map'ed guard page.  You could do this
with a read-only page, but that would only guarg against writes,
not read attempts.  The best way to do this in UNIX is not to
map it.  So if you are doing a general routine for doing this,
you will want to explicitly tract allocations -- _and_ any
non-allocations -- and then use a non-allocation to indicate a
that subsequent calls will not result in an allocation.  Again,
you will then have to deal with the segmentation fault signal
handler to detect this (note: you will need to reset the handler
each time it happens!).  This will take the place of your
STATUS_GUARD_PAGE system exception; since guard pages are
one-shot in Windows, it implies that they are applied to mapped
regions.  This means that you will have to put something in
place of the page, or the next time the access occurs, if you
did not reset the signal handler, it will result in a fault.

For this to work, you actually need to do the fixup described
earlier (unde MEM_WRITE_WATCH), and you will need to map a
real page in, and replace it, and restart the action that
caused the "guard" fault.  The main problem with this is
multiple consecutive write attempts in a region, with the
fault handler ebing reset each time.  Windows can't do this
quickly enough to trap all attempts, so it's probably OK to
have this on a page boundary.  Again, since the page will
have to come from backing store somewhere (or be written),
and you are attempting to avoid overcommit, you are best served
by using the file, and having the page doubly mapped, so that
it's in a hidden "guard region, but not mapped to the area, so
that when you need to map it, it's actually there, and you
don't get a resource related error.

I recommend against PAGE_NOCACHE.  You can use madvise(2) to
get the same effect, but only on anonymous memory backed by
swap.  For file backed regions, you can't make a file sparse
once it is non-sparse: FreeBSD doesn't support the fcntl(2)
method of ftruncate(2), which would be capable of releasing
reagions of disk blocks, and replacing the dirct or indirect
block references with zero, making the zero'ed pages on read,
and non-sparse on write.  Linux, Solaris, and SVR4 support
this, so if you are porting to multiple platforms, you might
want to use files anyway, and just note that FreeBSD files
will never become spare once non-sparse, and will continue to
consumes chunks of disk space, as a result.


My personal recommendation is that you use this information
to redesign, in part or in whole, your memory management code,
such that it is more suited to efficient operation on UNIX
platforms.  If all you need is to get it running, however,
then this should be enough information for you to get there.


Too bad there isn't an "Advance UNIX programming for Windows
Programmers" book.  8-(.

Good luck with your project...

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message