Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 6 Jul 2011 01:15:44 +0000
From:      John <jwd@slowblink.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>, freebsd-fs@freebsd.org
Subject:   Re: New NFS server stress test hang
Message-ID:  <20110706011544.GA69706@FreeBSD.org>
In-Reply-To: <20110610125939.GA69616@FreeBSD.org>
References:  <20110609133805.GA78874@FreeBSD.org> <1069270455.338453.1307636209760.JavaMail.root@erie.cs.uoguelph.ca> <20110610125939.GA69616@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
----- John's Original Message -----
> ----- Rick Macklem's Original Message -----
> > John De wrote:
> > > ----- Rick Macklem's Original Message -----
> > > > John De wrote:
> > > > > Hi,
> > > > >
> > > > > We've been running some stress tests of the new nfs server.
> > > > > The system is at r222531 (head), 9 clients, two mounts each
> > > > > to the server:
> > > > >
> > > > > mount_nfs -o
> > > > > udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=2
> > > > > ${servera}:/vol/datsrc /c/$servera/vol/datsrc
> > > > > mount_nfs -o
> > > > > udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=0
> > > > > ${servera}:/vol/datgen /c/$servera/vol/datgen
> > > > >
> > > > >
> > > > > The system is still up & responsive, simply no nfs services
> > > > > are working. All (200) threads appear to be active, but not
> > > > > doing anything. The debugger is not compiled into this kernel.
> > > > > We can run any other tracing commands desired. We can also
> > > > > rebuild the kernel with the debugger enabled for any kernel
> > > > > debugging needed.
> > > > >
> > > > > --- long logs deleted ---
> > > >
> > > > How about a:
> > > >  ps axHlww <-- With the "H" we'll see what the nfsd server threads
> > > >  are up to
> > > >  procstat -kka
> > > >
> > > > Oh, and a couple of nfsstats a few seconds apart. It's what the
> > > > counts
> > > > are changing by that might tell us what is going on. (You can use
> > > > "-z"
> > > > to zero them out, if you have an nfsstat built from recent sources.)
> > > >
> > > > Also, does a new NFS mount attempt against the server do anything?
> > > >
> > > > Thanks in advance for help with this, rick
> > > 
> > > Hi Rick,
> > > 
> > > Here's the output. In general, the nfsd processes appear to be in
> > > either nfsrvd_getcache(35 instances) or nfsrvd_updatecache(164)
> > > sleeping on
> > > "nfssrc". The server numbers don't appear to be moving. A showmount
> > > from a
> > > client system works, but a mount does not (see below).
> >
> > Please try the attached patch and let me know if it helps. When I looked
> > I found several places where the rc_flag variable was being fiddled without the
> > mutex held. I suspect one of these resulted in the RC_LOCKED flag not
> > getting cleared, so all the threads got stuck waiting on it.
> > 
> > The patch is at:
> >   http://people.freebsd.org/~rmacklem/cache.patch
> > in case it gets eaten by the list handler.
> > Thanks for digging into this, rick
> 
> Hi Rick,
> 
>    Patch applied. The system has been up and running for about
> 16 hours now and so far it's still handling the load quite nicely.
> 
> last pid: 15853;  load averages:  5.36,  4.64,  4.48          up 0+16:08:16
> 08:48:07
> 72 processes:  7 running, 65 sleeping
> CPU:     % user,     % nice,     % system,     % interrupt,     % idle
> Mem: 22M Active, 3345M Inact, 79G Wired, 9837M Buf, 11G Free
> Swap: 
> 
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
>  2049 root         26  52    0 10052K  1712K CPU3    3  97:21 942.24% nfsd
> 
>    I'll followup again in 24 hours with another status.
> 
>    Any performance related numbers/knobs we can provide that might
> be of interest?
> 
>    Thanks Rick.
> 
> -John

Hi Rick,

   We've run the nfs share patchs and produced some numbers, no errors
seen. A few questions about times.

Four sets of repeated runs, each set run for about 2 days, with the following nfs options:

tcp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=0
tcp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=2
udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=0
udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=2
    
negnametimeo=2 for /dat (src), 0 for /datgen (obj).  Switched out tcp and udp, otherwise left options the same.

stock = stock system
nolock = share patches

Run             Median build time (min)         Max build time (min)
Tcp.nolock:     11.2                            11.6
Tcp.stock:      13.6                            20.8  (Some of these ran over cooling issues, ie: additional heat)
Udp.nolock:     14.9                            15.3
Udp.stock:      20.6                            20.7
    
Average nfsd cpu usage:
    
Tcp.nolock:     197.46
Tcp.stock:      164.872
Udp.nolock:     374.656
Udp.stock:      339.156
    
These cpu numbers seem a bit confusing. Udp seems to have more overhead. The share patches
seem faster walkclock timewise, but use more cpu.

Thoughts?

Thanks,
John





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110706011544.GA69706>