Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 17 Dec 2004 08:48:29 +0100
From:      Peter Holm <peter@holm.cc>
To:        John Baldwin <jhb@FreeBSD.org>
Cc:        jroberson@chesapeake.net
Subject:   Re: Freeze
Message-ID:  <20041217074829.GA46675@peter.osted.lan>
In-Reply-To: <200412161645.05379.jhb@FreeBSD.org>
References:  <20041112123343.GA12048@peter.osted.lan> <200412161521.44026.jhb@FreeBSD.org> <20041216213157.GA41605@peter.osted.lan> <200412161645.05379.jhb@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Dec 16, 2004 at 04:45:05PM -0500, John Baldwin wrote:
> On Thursday 16 December 2004 04:31 pm, Peter Holm wrote:
> > On Thu, Dec 16, 2004 at 03:21:44PM -0500, John Baldwin wrote:
> > > On Monday 06 December 2004 08:59 am, Peter Holm wrote:
> > > > On Fri, Nov 19, 2004 at 05:10:19PM -0500, John Baldwin wrote:
> > > > > On Friday 19 November 2004 02:59 am, Peter Holm wrote:
> > > > > > On Mon, Nov 15, 2004 at 03:46:15PM -0500, John Baldwin wrote:
> > > > > > > On Friday 12 November 2004 07:33 am, Peter Holm wrote:
> > > > > > > > GENERIC HEAD from Nov 11 08:05 UTC
> > > > > > > >
> > > > > > > > The following stack traces etc. was done before my first
> > > > > > > > cup of coffee, so it's not so informative as it could have been
> > > > > > > > :-(
> > > > > > > >
> > > > > > > > The test box appeared to have been frozen for more than 6
> > > > > > > > hours, but was pingable.
> > > > > > > >
> > > > > > > > http://www.holm.cc/stress/log/cons86.html
> > > > > > >
> > > > > > > A weak guess is that you have the system in some sort of livelock
> > > > > > > due to fork()?  Have you tried running with 'debug.mpsafevm=1'
> > > > > > > set from the loader?
> > > > > > >
> > > > > > > --
> > > > > > > John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
> > > > > > > "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org
> > > > > >
> > > > > > OK, I've got some more info:
> > > > > >
> > > > > > http://www.holm.cc/stress/log/cons88.html
> > > > > >
> > > > > > Looks like a spin in uma_zone_slab() when slab_zalloc() fails?
> > > > >
> > > > > Yes, I think if you specify M_WAITOK, then that might happen.
> > > > > slab_zalloc() can fail if any of the init functions fail for example,
> > > > > in which case it would loop forever.  You can try this hack (though
> > > > > it may very well be wrong) to return failure if that is what is
> > > > > triggering:
> > > > >
> > > > > Index: uma_core.c
> > > > > ===================================================================
> > > > > RCS file: /usr/cvs/src/sys/vm/uma_core.c,v
> > > > > retrieving revision 1.110
> > > > > diff -u -r1.110 uma_core.c
> > > > > --- uma_core.c	6 Nov 2004 11:43:30 -0000	1.110
> > > > > +++ uma_core.c	19 Nov 2004 22:08:26 -0000
> > > > > @@ -1998,6 +1998,10 @@
> > > > >  		 */
> > > > >  		if (flags & M_NOWAIT)
> > > > >  			flags |= M_NOVM;
> > > > > +
> > > > > +		/* XXXHACK */
> > > > > +		if (flags & M_WAITOK)
> > > > > +			break;
> > > > >  	}
> > > > >  	return (slab);
> > > > >  }
> > > > >
> > > > > --
> > > > > John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
> > > > > "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org
> > > >
> > > > I instrumented the code with this:
> > > > $ cvs diff -u
> > > > cvs diff: Diffing .
> > > > Index: uma_core.c
> > > > ===================================================================
> > > > RCS file: /home/ncvs/src/sys/vm/uma_core.c,v
> > > > retrieving revision 1.110
> > > > diff -u -r1.110 uma_core.c
> > > > --- uma_core.c  6 Nov 2004 11:43:30 -0000       1.110
> > > > +++ uma_core.c  6 Dec 2004 13:49:36 -0000
> > > > @@ -1926,6 +1926,7 @@
> > > >  {
> > > >         uma_slab_t slab;
> > > >         uma_keg_t keg;
> > > > +       int i;
> > > >
> > > >         keg = zone->uz_keg;
> > > >
> > > > @@ -1943,7 +1944,8 @@
> > > >
> > > >         slab = NULL;
> > > >
> > > > -       for (;;) {
> > > > +       for (i = 0;;i++) {
> > > > +               KASSERT(i < 10000, ("uma_zone_slab is looping"));
> > > >                 /*
> > > >                  * Find a slab with some space.  Prefer slabs that are
> > > > partially * used over those that are totally full.  This helps to
> > > > reduce
> > > >
> > > > and now during test of Jeff Roberson's "SMP FFS" patch the assert
> > > > triggered: http://www.holm.cc/stress/log/cons92.html
> > >
> > > Hmm.  Does the hack patch above make the hang go away or does it just
> > > break things worse?
> >
> > How would an assert make a problem go away? It was meant as a tool
> > to figure out the source of the problem; The freeze.
> 
> I was referring to my earlier patch that breaks out of the loop if M_WAITOK is 
> set so that it shouldn't spin at all in that case.  Do you have that hackish 
> patch already applied and it's spinning anyway?
> 

Oh, dear. Communicating (especially via email) is so hard!

And no, I never applied your patch. 

It now seems I'm able to reproduce the freeze more often, that is
within 12 to 15 hours of testing. I have a freeze right now on my
test box:

I can ping the test box and the console is active, but I can not
log in.

When I break into the debugger I enter different processes,
but the stack traces all end up in:

uma_zalloc_internal(102,c1064dc0,102,c1052dc0,c1052dc0) at
uma_zalloc_internal+0x23
slab_zalloc(c1052dc0,8,c1064dc0,cf78fc5c,c0774af9) at
slab_zalloc+0x33b
uma_zone_slab(c1064dc8,8,c084375c,877) at uma_zone_slab+0x7c
uma_zalloc_internal(102,0,102) at uma_zalloc_internal+0x2d
malloc(acc,c0888d00,102,131ae,cf78fccc) at malloc+0x6b

I'll go ahead and apply your patch to see if it alleviates the
freeze problem.

Regards,

- Peter

> -- 
> John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
> "Power Users Use the Power to Serve"  =  http://www.FreeBSD.org

-- 
Peter Holm



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20041217074829.GA46675>