Date: Tue, 29 Jul 2003 23:11:30 +0200 From: Poul-Henning Kamp <phk@phk.freebsd.dk> To: jeffr@freebsd.org Cc: current@freebsd.org Subject: HEADSUP: UMA not reentrant / possible memory leak Message-ID: <88569.1059513090@critter.freebsd.dk>
next in thread | raw e-mail | index | archive | help
[I'm CC'ing current because this seems to have a significant negative impact on -current kernel stability, and we can use some more data, in particular on non-i386 SMP machines] Thanks to Lukas Ertl and Bosko we have found a clear indication that UMA is in fact not reentrant (enough). The indication of this is that the g_bio zone does not return to zero USED as it should. The attached patch adds an atomic counter in GEOM to count the number of actually used items in the sysctl variable debug.ngbio. Here is a typical output from my SMP box: bang# sh a.sh g_bio: 144, 0, 35, 77, 4281 debug.ngbio: 0 10:58PM up 36 secs, 1 user, load averages: 0.65, 0.20, 0.07 g_bio: 144, 0, 66, 102, 5917 debug.ngbio: 0 10:58PM up 56 secs, 3 users, load averages: 0.46, 0.18, 0.07 g_bio: 144, 0, 69, 99, 12352 debug.ngbio: 0 10:59PM up 1 min, 3 users, load averages: 0.56, 0.22, 0.09 g_bio: 144, 0, 185, 123, 20023 debug.ngbio: 0 10:59PM up 2 mins, 3 users, load averages: 0.62, 0.25, 0.10 g_bio: 144, 0, 227, 81, 28259 debug.ngbio: 0 10:59PM up 2 mins, 3 users, load averages: 0.64, 0.28, 0.11 g_bio: 144, 0, 222, 86, 32256 debug.ngbio: 0 11:00PM up 2 mins, 3 users, load averages: 0.74, 0.33, 0.13 Notice that the USED column fluctuates both up and down. Other machines are able to reproduce negative USED counts. As you can see in the patch I have added a mutex around the zone operations in order to see if that solved the issue, and it doesn't seem to make any difference at all. I am unable to tell if it is just the UMA zone statistics which are f**ked up, or if the "important" data structures in UMA are also victims of this. The machines which Lukas and Bosko work on seem to die after some short period of time, and this could indicate that this is not just statistics being b0rked. We see this problem also on GCC 3.2.2 machines. HELP! Poul-Henning Index: geom_io.c =================================================================== RCS file: /home/ncvs/src/sys/geom/geom_io.c,v retrieving revision 1.44 diff -u -r1.44 geom_io.c --- geom_io.c 18 Jun 2003 10:33:09 -0000 1.44 +++ geom_io.c 29 Jul 2003 20:51:55 -0000 @@ -39,6 +39,7 @@ #include <sys/param.h> #include <sys/systm.h> #include <sys/kernel.h> +#include <sys/sysctl.h> #include <sys/malloc.h> #include <sys/bio.h> @@ -55,6 +56,12 @@ static u_int pace; static uma_zone_t biozone; +struct mtx gbiomutex; +static int ngbio; +SYSCTL_INT(_debug, OID_AUTO, ngbio, CTLFLAG_RD, + &ngbio, 0, ""); + + #include <machine/atomic.h> static void @@ -116,15 +123,26 @@ { struct bio *bp; + mtx_lock(&gbiomutex); bp = uma_zalloc(biozone, M_NOWAIT | M_ZERO); + mtx_unlock(&gbiomutex); + if (bp != NULL) + atomic_add_int(&ngbio, 1); return (bp); } void g_destroy_bio(struct bio *bp) { - + if (bp == NULL) { + printf("g_destroy_bio(NULL)"); + Debugger("foo"); + return; + } + mtx_lock(&gbiomutex); uma_zfree(biozone, bp); + mtx_unlock(&gbiomutex); + atomic_add_int(&ngbio, -1); } struct bio * @@ -132,8 +150,11 @@ { struct bio *bp2; + mtx_lock(&gbiomutex); bp2 = uma_zalloc(biozone, M_NOWAIT | M_ZERO); + mtx_unlock(&gbiomutex); if (bp2 != NULL) { + atomic_add_int(&ngbio, 1); bp2->bio_parent = bp; bp2->bio_cmd = bp->bio_cmd; bp2->bio_length = bp->bio_length; @@ -304,6 +325,7 @@ bzero(&mymutex, sizeof mymutex); mtx_init(&mymutex, "g_xdown", MTX_DEF, 0); + mtx_init(&gbiomutex, "gbio", MTX_DEF, 0); for(;;) { g_bioq_lock(&g_bio_run_down); -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?88569.1059513090>