From owner-freebsd-arch  Tue Apr 10 19:35:15 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67])
	by hub.freebsd.org (Postfix) with ESMTP id 4363937B422
	for <freebsd-arch@FreeBSD.ORG>; Tue, 10 Apr 2001 19:35:00 -0700 (PDT)
	(envelope-from dillon@earth.backplane.com)
Received: (from dillon@localhost)
	by earth.backplane.com (8.11.2/8.11.2) id f3B2Ysj97756;
	Tue, 10 Apr 2001 19:34:54 -0700 (PDT)
	(envelope-from dillon)
Date: Tue, 10 Apr 2001 19:34:54 -0700 (PDT)
From: Matt Dillon <dillon@earth.backplane.com>
Message-Id: <200104110234.f3B2Ysj97756@earth.backplane.com>
To: Peter Jeremy <peter.jeremy@alcatel.com.au>
Cc: freebsd-arch@FreeBSD.ORG
Subject: Re: mmap(2) vs read(2)/write(2)
References:  <20010411095233.P66243@gsmx07.alcatel.com.au>
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


:It is my understanding that it is more efficient to access a file
:via mmap rather than read/write, because the former needs one less
:memory-memory copy.

    Yes and No.  If the file is already in the cache, then mmap()
    is much faster because the program doesn't have to take any
    VM faults to access the data.  But if the file is not in the
    cache the program winds up taking VM faults to map the pages
    in, and this is expensive enough that it goes a long ways towards
    making up for the copy overhead you would get with read().

    This can be demonstrated with a test program.  Create two test files,
    test1 and test2 using dd.  One should be much larger then main memory,
    the other should be about 1/4 main memory.  Run the program a couple
    of times before recording the results so prior runs or cache state 
    does not interfere with the test results.  Here is an example:

	# this assumes you have around 128M of ram
	dd if=/dev/zero of=test1 bs=1m count=1024	# create 1G file
	dd if=/dev/zero of=test1 bs=1m count=32		# create 32M file


% ./rf -f test1
% ./rf -f test1
% ./rf -f test1
cksum 0 read 1073741824 bytes in 41.311 seconds, 24.788 MB/sec cpu 15.167 sec
% ./rf -m test1
% ./rf -m test1
% ./rf -m test1
cksum 0 read 1073741824 bytes in 48.371 seconds, 21.170 MB/sec cpu 12.130 sec

% ./rf -f test2
% ./rf -f test2
% ./rf -f test2
cksum 0 read 33554432 bytes in 0.367 seconds, 87.295 MB/sec cpu 0.368 sec
% ./rf -m test2
% ./rf -m test2
% ./rf -m test2
cksum 0 read 33554432 bytes in 0.271 seconds, 117.958 MB/sec cpu 0.273 sec


    For the big file mmap() has lower performance (21.1MB/sec verses
    24.7MB/sec), but actually eats fewer cpu cycles.  In this case it
    is obvious that read() has a higher copy overhead, but the overhead
    is not interfering with the transfer rate.  mmap()'s VM fault overhead,
    on the otherhand, is interfering with the transfer rate.  It might be
    possible for me to fix this -- it has to do with the way VM fault does
    lookahead reads (it doesn't start the next lookahead read until it gets
    half way through the previous lookahead read).  But the jist is that
    if the data is not in the cache, read() could very well be faster then
    mmap().

    For the small file, mmap() wins hands down.  (118MB/sec vs 87MB/sec),
    and takes less cpu as well (0.273 verses 0.368).

    If you comment out the madvise() for the small-file tests, performance
    goes down to around 114MB/sec in my test - the cost of taking 2788
    VM faults in the cache case.  Still better then read().

    So what is the final answer?  mmap() will be significantly faster
    for small cached files but the benefits are minimal or even
    possibly detrimental when used on large uncached files. 

    You also have to consider the effect on the process's VM space.  If
    a program is depending on there being 3G of mmapable space in its
    address space and you start mmap()ing files for stdio functions, and
    the program happens to also use a lot of stdio (fopen() and such),
    you could very well be polluting the mmapable space so much that
    the program fails.

    If we were to implement mmap() for stdio, it would have to be done very,
    very carefully to avoid unwanted side effects.  I remember NeXT using
    mmap() for stdio, and I also remember hitting up against all sorts of
    weird side effects that caused me to want to tear my hair out.  Ultimately
    I think the best solution is to add a setvbuf() mode #define to set
    a 'use mmap' mode, e.g. something like _IOMBF, and not have it do it by
    default.  Then programs using the feature could be made portable with
    a simple #ifdef _IOMBF around the setvbuf call.

						-Matt


/*
 * readfile [-f/-m] filename
 *
 * cc -O2 readfile.c -o rf
 */
#include <sys/types.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

int
main(int ac, char **av)
{
    int i;
    int memMode = -1;
    int fd;
    int cksum = 0;
    const char *path = NULL;
    struct timeval tv1;
    struct timeval tv2;
    struct stat st;
    struct rusage ru;

    for (i = 1; i < ac; ++i) {
	char *ptr = av[i];
	if (*ptr != '-') {
	    path = av[i];
	    continue;
	}
	switch(ptr[1]) {
	case 'f':
	    memMode = 0;
	    break;
	case 'm':
	    memMode = 1;
	    break;
	default:
	    fprintf(stderr, "Bad option: %s\n", ptr);
	    exit(1);
	}
    }
    if (memMode < 0) {
	fprintf(stderr, "Specify mode -f or -m\n");
	exit(1);
    }
    if (path == NULL) {
	fprintf(stderr, "Specify file to read\n");
	exit(1);
    }
    if (stat(path, &st) < 0 || !S_ISREG(st.st_mode)) {
	fprintf(stderr, "bad filespec: %s\n", path);
	exit(1);
    }
    if ((fd = open(path, O_RDONLY)) < 0) {
	perror("open");
	exit(1);
    }

    gettimeofday(&tv1, NULL);
    if (memMode) {
	int *base = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
	int n = st.st_size / sizeof(int);
	int i;

	if (base == MAP_FAILED) {
	    fprintf(stderr, "unable to mmap file\n");
	    exit(1);
	}
	madvise(base, st.st_size, MADV_WILLNEED);
	for (i = 0; i < n; ++i)
	    cksum += base[i];
    } else {
	char buf[32768];
	int n;
	while ((n = read(fd, buf, sizeof(buf))) > 0) {
	    n = n / sizeof(int);
	    for (i = 0; i < n; ++i)
		cksum += buf[i];
	}
    }
    gettimeofday(&tv2, NULL);
    getrusage(RUSAGE_SELF, &ru);

    {
	double usec = (tv2.tv_usec + 1000000 - tv1.tv_usec) +
			(tv2.tv_sec - tv1.tv_sec - 1) * 1000000.0;
	printf("cksum %d read %qd bytes in %4.3f seconds, %4.3f MB/sec cpu %4.3f sec\n",
	    cksum,	/* so compiler does not optimize it out */
	    st.st_size,
	    usec / 1000000.0,
	    (double)st.st_size / (usec * 1024.0 * 1024.0 / 1000000.0),
	    ((ru.ru_utime.tv_usec + ru.ru_stime.tv_usec) +
	    (ru.ru_utime.tv_sec + ru.ru_stime.tv_sec) * 1.0E6) / 1.0E6
	);
    }
    return(0);
}

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message