Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 22 Sep 2001 01:17:44 -0400
From:      Chuck Cranor <chuck@research.att.com>
To:        freebsd-stable@FreeBSD.ORG
Cc:        Chuck Cranor <chuck@research.att.com>
Subject:   why my 4.4-RELEASE kernel deadlocks
Message-ID:  <20010922011744.B109536@chips.research.att.com>

next in thread | raw e-mail | index | archive | help
hi-

i have debugged the problem, here is the scoop:

Background:

   every allocated FFS vnode has a private memory area allocated as 
malloc type "FFS node"...  each FFS node allocation takes a 512 byte 
block out of the kernel memory allocator.   this allocation occurs in 
the function ffs_vget() (see file sys/ufs/ffs/ffs_vfsops.c, look for
MALLOC).

   the kernel memory allocator limits the number of memory allocations
of each type to 102400K.   if you try and allocate more than that, the
kernel may sleep waiting for memory of that type to become free.   you
can see the current allocation with "vmstat -m" (note MemUse and Limit):

Memory statistics by type                          Type  Kern
        Type  InUse MemUse HighUse  Limit Requests Limit Limit Size(s)
     FFS node   334   167K    167K102400K     1170    0     0  512

if each "FFS node" allocation takes 512 bytes, then you can have at
most 102400K/512 (i.e. 204800) nodes allocated before the kernel malloc
refuses to allocate any more nodes.


   now consider when vnodes are allocated and freed.   looking at
file sys/kern/vfs_subr.c, the function getnewvnode() allocates new
vnodes.   it attempts to recycle a free vnode that has already been
allocated before it allocates a new one.   free vnodes are stored
on the global list vnode_free_list and counted by the global "freevnodes".
you can see the current value of freevnodes using the command 
"sysctl -a | grep freevnodes"...    if there are not enough free vnodes,
then the system allocates a new vnode using zalloc(vnode_zone).

    vnodes are freed by the vrele() function in the same file, but
only if their v_usecount is going to drop to zero and VSHOULDFREE() is
true.   looking at sys/sys/vnode.h, VSHOULDFREE is true if the
neither VFREE or VDOOMED flags are set, the hold count is zero,
the use count is zero, and the vm_object's reference count and
page count is zero (if there is a vm_object allocated).


Problem:
 - vrele() does not free vnodes that have resident pages of memory
	associated with them (VSHOULDFREE will be false).

 - there is a global limit on the number of allocated vnodes on the
	system.   this is in the global int "desiredvnodes" which 
	shows up as "kern.maxvnodes" in "sysctl -a" output.
	** FreeBSD kernel basically ignores kern.maxvnodes **

 - the kernel will free an inactive vnode if all it's pages of memory
	in its vm_object become non-resident (e.g. paged out).

-> if you have a system with a large amount of RAM (e.g. >800MB)
	it is possible for any user to create enough vnodes to fill
	the "FFS node" kernel malloc area and deadlock the system.


the key is to create a large number of inactive vnodes each of which
has a very small number of pages associated with (ideal: 1 page per
vnode).   [hint: think about "cvs co -AP ports"]



here is an example program that creates a specified number of small files:

/*
 * try.c  chuck@research.att.com
 * will deadlock FreeBSD kernel if system has lots of RAM and mx is large
 * enough (e.g. >204800).   tested on system with 2GB RAM.
 */
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

main(int argc, char **argv) {
  int psz = getpagesize();
  char *buf;
  char fn[512];
  int fd, lcv, mx;

  if (argc != 2) errx(1, "usage: try number-of-files");
  mx = atoi(argv[1]);
  if (mx < 1) errx(1, "usage: try number-of-files");

  buf = malloc(psz);
  if (!buf) err(1, "malloc");
  bzero(buf, psz);

  strcpy(buf, "hi there!\n");

  /* create directories for a lot of files */
  for (lcv = 0 ; lcv < mx ; lcv += 1000) {
    sprintf(fn, "%d.d", lcv/1000);
    if (mkdir(fn, 0777) < 0) err(1, "mkdir");
  }

  /* create alot of 1 page files */
  for (lcv = 0 ; lcv < mx ; lcv++) {
    if ((lcv % 1000) == 0) {
      printf("%dk ", lcv / 1000);
      fflush(stdout);
    }
    sprintf(fn, "%d.d/%d.dat", lcv/1000, lcv%1000);
    fd = open(fn, O_CREAT|O_RDWR, 0666);
    if (fd < 0) err(1, "open");
    if (write(fd, buf, psz) != psz) err(1, "write");
    close(fd);
  }
}

now here is a script of me using it.    note that before i run it,
debug.numvnodes is less than kern.maxvnodes.   when it finishes,
note that debug.numvnodes is much greater than kern.maxvnodes and
the kernel memory allocation for "FFS node" reported by "vmstat -m"
is quite large (i didn't run it all the way to deadlock):


Script started on Sat Sep 22 00:10:05 2001
cdn3> sysctl -a | egrep 'maxvnodes|numvnodes|freevnodes'
kern.maxvnodes: 129183
debug.numvnodes: 340
debug.wantfreevnodes: 25
debug.freevnodes: 24
cdn3> ./try 150000
0k 1k 2k 3k 4k 5k 6k 7k 8k 9k 10k 11k 12k 13k 14k 15k 16k 17k 18k 19k 20k 21k 22k 23k 24k 25k 26k 27k 28k 29k 30k 31k 32k 33k 34k 35k 36k 37k 38k 39k 40k 41k 42k 43k 44k 45k 46k 47k 48k 49k 50k 51k 52k 53k 54k 55k 56k 57k 58k 59k 60k 61k 62k 63k 64k 65k 66k 67k 68k 69k 70k 71k 72k 73k 74k 75k 76k 77k 78k 79k 80k 81k 82k 83k 84k 85k 86k 87k 88k 89k 90k 91k 92k 93k 94k 95k 96k 97k 98k 99k 100k 101k 102k 103k 104k 105k 106k 107k 108k 109k 110k 111k 112k 113k 114k 115k 116k 117k 118k 119k 120k 121k 122k 123k 124k 125k 126k 127k 128k 129k 130k 131k 132k 133k 134k 135k 136k 137k 138k 139k 140k 141k 142k 143k 144k 145k 146k 147k 148k 149k cdn3> sysctl -a | egrep 'maxvnodes|numvnodes|freevnodes'
kern.maxvnodes: 129183
debug.numvnodes: 150334
debug.wantfreevnodes: 25
debug.freevnodes: 25
cdn3> vmstat -m | grep 'FFS no'
 512  ATA generic, UFS mount, FFS node, ifaddr, mount, BIO buffer, USBdev,
     FFS node150331 75166K  75166K102400K   151811    0     0  512
cdn3> 
cdn3> 


now, if there are enough other things trying to get RAM on the system, then
they will cause the RAM for the small files we've created to be paged out 
and reallocated.  if all the pages are removed from a vnode, then it gets 
freed.  thus, an active system is less likely to deadlock because other 
users will be pushing these vnodes out of RAM.  here is a simple program 
that allocates 1GB of RAM:

/*
 * mzero.c  chuck@research.att.com
 */
#include <stdio.h>
#include <stdlib.h>

#define GB	(1*1024*1024*1024)
main() {
  void *p;
  int c;

  p = malloc(GB);
  if (!p) err(1, "malloc");
  printf("malloc done, bzeroing....\n");
  bzero(p, GB);
  printf("bzero done\n");
  printf("hit <cr> ... ");
 
  c = getchar();
  exit(0);
}

watch what happens to debug.freevnodes when i run this program:

cdn3> ./mzero &
[1] 298
cdn3> malloc done, bzeroing....
bzero done
hit <cr> ... 
[1]  + Suspended (tty input)         ./mzero
cdn3> sysctl -a | egrep 'maxvnodes|numvnodes|freevnodes'
kern.maxvnodes: 129183
debug.numvnodes: 150336
debug.wantfreevnodes: 25
debug.freevnodes: 43800
cdn3> ./mzero &
[2] 301
cdn3> malloc done, bzeroing....
bzero done
hit <cr> ... 
[2]  + Suspended (tty input)         ./mzero
cdn3> sysctl -a | egrep 'maxvnodes|numvnodes|freevnodes'
kern.maxvnodes: 129183
debug.numvnodes: 150336
debug.wantfreevnodes: 25
debug.freevnodes: 146820
cdn3> vmstat -m | grep 'FFS no'
 512  ATA generic, UFS mount, FFS node, ifaddr, mount, BIO buffer, USBdev,
     FFS node150331 75166K  75166K102400K   151811    0     0  512
cdn3> 


the vmstat shows the final memory usage in this test.   most
of the 75166K is on the free list (note debug.freevnodes=146820).



Fix:  
you could hack around the problem by increasing the size of the FFS 
node malloc area.   however, i believe it is wrong for the FreeBSD 
kernel to ignore the value of kern.maxvnodes.   the kernel needs to 
be smarter about how it recycles vnodes when it reaches the 
kern.maxvnodes limit.   specifically, it should reclaim some of the 
inactive vnodes that have pages of memory associated with them.



chuck


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010922011744.B109536>