From owner-freebsd-hackers Tue Sep 24 01:10:45 1996 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id BAA11864 for hackers-outgoing; Tue, 24 Sep 1996 01:10:45 -0700 (PDT) Received: from labinfo.iet.unipi.it (labinfo.iet.unipi.it [131.114.9.5]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id BAA11829; Tue, 24 Sep 1996 01:10:38 -0700 (PDT) Received: from localhost (luigi@localhost) by labinfo.iet.unipi.it (8.6.5/8.6.5) id JAA00165; Tue, 24 Sep 1996 09:38:24 +0200 From: Luigi Rizzo Message-Id: <199609240738.JAA00165@labinfo.iet.unipi.it> Subject: Optimizing bzero() To: hackers@freebsd.org, bde@freebsd.org, asami@freebsd.org Date: Tue, 24 Sep 1996 09:38:24 +0200 (MET DST) X-Mailer: ELM [version 2.4 PL23] Content-Type: text Sender: owner-hackers@freebsd.org X-Loop: FreeBSD.org Precedence: bulk During some discussion on his great phkmalloc(), Poul pointed out to me the existence of madvise(... , MADV_FREE) and I thought it could be possible to build an optimized bzero(). I'll try to sum up the result of the discussion: If, for whatever reason: * madvise(..., MADV_FREE) causes the next access to the page to access a zeroed page; * the behaviour above is not going to change; * this might be a faster way to zero a page there I have a faster bzero(). Even if it is non portable across different architectures (but it is!), or it exploits an architectural feature (and it does not!), so what ? After all, the current pentium-optimized bcopy() has much more architectural dependencies. Now, a bit about performance. Let's say, just to set a number, that you can write zeroes to a page at 200MB/s, then you need some 20us to clear a page (this time should be approx. linear with the number of pages). It's just a matter of determining, _on the same system_, how much would it cost to use madvise() instead (I expect some high overhead for the system call, plus a modest cost per page to free the entry), plus the cost of a page fault each time I access one of these bzeroed() pages. To sum up, the pseudo code for the optimized bzero would be as follows: if (len < LOW_THRESHOLD) zero_using_rep_stosb(); else if (len < N_PAGES*4096) zero_taking_care_of_alignement_and_pentium_opt_etc(); else { bzero from the beginning to the first page boundary; bzero from the last page boundary to the end; call madvise(... MADV_FREE) on remaining pages. } When do I gain performance ? Of course, when there is enough CPU and/or memory available so that by the time the page is accessed the kernel already has a zero page available. But I see other advantages. Consider that often memory is overallocated (e.g. hash tables etc.) and, as you correctly say, malloc() does not give guarantees on having them zeroed. So you have to bzero() malloc'ed pages, and this makes them all mapped. By writing your code differently, e.g. by declaring a large bss array and using only the amount of memory you actually need, you have the guarantee that it is zeroed (security side effect) and it uses the zero-fill-on-demand mechanism which is already in the kernel. Now, why should a "careful" programmer (using malloc() & bzero() to get memory only when he needs) be penalized in performance wrt a less careful one which simply asks for a large chunk of memory at program's startup and then uses only a few bits ? The latter is a common situation: not long ago, when I was studying memory compression, I looked at swap pages and found that it was almost never the case that a fully-zeroed page went to the swap area -- they are unmapped, and if you access them, chances are that you actually write to them. If you force a core-dump of some running program (I tried with many: shells, sendmail, ftpd, ... all the usual programs that are run by default on a system), these fully-zero pages go to the core file. If you look at these core files, you will always find some 20 fully-zero pages. What does this mean: 1. some code (perhaps libc() itself) overallocates memory; 2. the ZFOD mechanism works very well, saving you some 80KB per process; considering that I, on average, have 50 processes running, then this amounts to 4MB savings. On a busy server such as freefall, with 1000 or so contemporary processes, this is a substantial saving. 3. if the above code were rewritten to use malloc() and bzero() to make a more careful use of the memory, much of the above advantages would instantly disappear. > madvise(MADV_FREE) doesn't zero anything. It merely tell the kernel > that nobody loves this page, so if it is convenient, just hand me ^^^^^^^^^^^^^^^^^^^ > any page instead next time I ask for it. The only doubt I have is the following: * does madvise(... MADV_FREE) ALWAYS unmap the selected pages, or it only does if it is convenient ? The following code fragment in vm_object_madvise() is a bit unclear to me. ... } else if ((advise == MADV_DONTNEED) || ((advise == MADV_FREE) && ((object->type != OBJT_DEFAULT) && (object->type != OBJT_SWAP)))) { vm_page_deactivate(m); } else if (advise == MADV_FREE) { /* * Force a demand-zero on next ref */ if (object->type == OBJT_SWAP) swap_pager_dmzspace(object, m->pindex, 1); vm_page_protect(m, VM_PROT_NONE); vm_page_free(m); } ... Luigi ==================================================================== Luigi Rizzo Dip. di Ingegneria dell'Informazione email: luigi@iet.unipi.it Universita' di Pisa tel: +39-50-568533 via Diotisalvi 2, 56126 PISA (Italy) fax: +39-50-568522 http://www.iet.unipi.it/~luigi/ ====================================================================