From owner-freebsd-hackers  Tue Sep 24 01:10:45 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id BAA11864
          for hackers-outgoing; Tue, 24 Sep 1996 01:10:45 -0700 (PDT)
Received: from labinfo.iet.unipi.it (labinfo.iet.unipi.it [131.114.9.5])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id BAA11829;
          Tue, 24 Sep 1996 01:10:38 -0700 (PDT)
Received: from localhost (luigi@localhost) by labinfo.iet.unipi.it (8.6.5/8.6.5) id JAA00165; Tue, 24 Sep 1996 09:38:24 +0200
From: Luigi Rizzo <luigi@labinfo.iet.unipi.it>
Message-Id: <199609240738.JAA00165@labinfo.iet.unipi.it>
Subject: Optimizing bzero()
To: hackers@freebsd.org, bde@freebsd.org, asami@freebsd.org
Date: Tue, 24 Sep 1996 09:38:24 +0200 (MET DST)
X-Mailer: ELM [version 2.4 PL23]
Content-Type: text
Sender: owner-hackers@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

During some discussion on his great phkmalloc(), Poul pointed out to me
the existence of madvise(... , MADV_FREE) and I thought it could be
possible to build an optimized bzero(). I'll try to sum up the result
of the discussion:

If, for whatever reason:

    * madvise(..., MADV_FREE) causes the next access to the page
      to access a zeroed page;
    * the behaviour above is not going to change;
    * this might be a faster way to zero a page

there I have a faster bzero(). Even if it is non portable across
different architectures (but it is!), or it exploits an architectural
feature (and it does not!), so what ? After all, the current
pentium-optimized bcopy() has much more architectural dependencies.

Now, a bit about performance. Let's say, just to set a number, that
you can write zeroes to a page at 200MB/s, then you need some 20us
to clear a page (this time should be approx. linear with the number
of pages).  It's just a matter of determining, _on the same system_,
how much would it cost to use madvise() instead (I expect some high
overhead for the system call, plus a modest cost per page to free
the entry), plus the cost of a page fault each time I access one
of these bzeroed() pages.

To sum up, the pseudo code for the optimized bzero would be as follows:

	if (len < LOW_THRESHOLD)
		zero_using_rep_stosb();
	else if (len < N_PAGES*4096)
		zero_taking_care_of_alignement_and_pentium_opt_etc();
	else {
		bzero from the beginning to the first page boundary;
		bzero from the last page boundary to the end;
		call madvise(... MADV_FREE) on remaining pages.
	}

When do I gain performance ? Of course, when there is enough CPU
and/or memory available so that by the time the page is accessed
the kernel already has a zero page available.

But I see other advantages. Consider that often memory is overallocated
(e.g. hash tables etc.) and, as you correctly say, malloc() does
not give guarantees on having them zeroed. So you have to bzero()
malloc'ed pages, and this makes them all mapped.  By writing your
code differently, e.g. by declaring a large bss array and using
only the amount of memory you actually need, you have the guarantee
that it is zeroed (security side effect) and it uses the
zero-fill-on-demand mechanism which is already in the kernel.

Now, why should a "careful" programmer (using malloc() & bzero()
to get memory only when he needs) be penalized in performance wrt
a less careful one which simply asks for a large chunk of memory
at program's startup and then uses only a few bits ?

The latter is a common situation: not long ago, when I was studying
memory compression, I looked at swap pages and found that it was
almost never the case that a fully-zeroed page went to the swap
area -- they are unmapped, and if you access them, chances are that
you actually write to them. If you force a core-dump of some running
program (I tried with many: shells, sendmail, ftpd, ... all the
usual programs that are run by default on a system), these fully-zero
pages go to the core file. If you look at these core files, you
will always find some 20 fully-zero pages. What does this mean:

1. some code (perhaps libc() itself) overallocates memory;
2. the ZFOD mechanism works very well, saving you some 80KB per process;
   considering that I, on average, have 50 processes running, then
   this amounts to 4MB savings. On a busy server such as freefall, with
   1000 or so contemporary processes, this is a substantial saving.
3. if the above code were rewritten to use malloc() and bzero()
   to make a more careful use of the memory, much of the above
   advantages would instantly disappear.


> madvise(MADV_FREE) doesn't zero anything.  It merely tell the kernel
> that nobody loves this page, so if it is convenient, just hand me
				  ^^^^^^^^^^^^^^^^^^^
> any page instead next time I ask for it.

The only doubt I have is the following:

    * does madvise(... MADV_FREE) ALWAYS unmap the selected
      pages, or it only does if it is convenient ? The following
      code fragment in vm_object_madvise() is a bit unclear to me.

		...
                } else if ((advise == MADV_DONTNEED) ||
                        ((advise == MADV_FREE) &&
                                ((object->type != OBJT_DEFAULT) &&
                                        (object->type != OBJT_SWAP)))) {
                        vm_page_deactivate(m);
                } else if (advise == MADV_FREE) {
                        /*
                         * Force a demand-zero on next ref
                         */
                        if (object->type == OBJT_SWAP)
                                swap_pager_dmzspace(object, m->pindex, 1);
                        vm_page_protect(m, VM_PROT_NONE);
                        vm_page_free(m);
                }
		...


	Luigi
====================================================================
Luigi Rizzo                     Dip. di Ingegneria dell'Informazione
email: luigi@iet.unipi.it       Universita' di Pisa
tel: +39-50-568533              via Diotisalvi 2, 56126 PISA (Italy)
fax: +39-50-568522              http://www.iet.unipi.it/~luigi/
====================================================================