FreeBSD Mail Archives

Date:      Wed, 6 Jul 2011 10:32:18 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        John <jwd@slowblink.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: New NFS server stress test hang
Message-ID:  <1060425320.262543.1309962738973.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <20110706011544.GA69706@FreeBSD.org>

John De wrote:
----- Original Message -----
> ----- John's Original Message -----
> > ----- Rick Macklem's Original Message -----
> > > John De wrote:
> > > > ----- Rick Macklem's Original Message -----
> > > > > John De wrote:
> > > > > > Hi,
> > > > > >
> > > > > > We've been running some stress tests of the new nfs server.
> > > > > > The system is at r222531 (head), 9 clients, two mounts each
> > > > > > to the server:
> > > > > >
> > > > > > mount_nfs -o
> > > > > > udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=2
> > > > > > ${servera}:/vol/datsrc /c/$servera/vol/datsrc
> > > > > > mount_nfs -o
> > > > > > udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=0
> > > > > > ${servera}:/vol/datgen /c/$servera/vol/datgen
> > > > > >
> > > > > >
> > > > > > The system is still up & responsive, simply no nfs services
> > > > > > are working. All (200) threads appear to be active, but not
> > > > > > doing anything. The debugger is not compiled into this
> > > > > > kernel.
> > > > > > We can run any other tracing commands desired. We can also
> > > > > > rebuild the kernel with the debugger enabled for any kernel
> > > > > > debugging needed.
> > > > > >
> > > > > > --- long logs deleted ---
> > > > >
> > > > > How about a:
> > > > >  ps axHlww <-- With the "H" we'll see what the nfsd server
> > > > >  threads
> > > > >  are up to
> > > > >  procstat -kka
> > > > >
> > > > > Oh, and a couple of nfsstats a few seconds apart. It's what
> > > > > the
> > > > > counts
> > > > > are changing by that might tell us what is going on. (You can
> > > > > use
> > > > > "-z"
> > > > > to zero them out, if you have an nfsstat built from recent
> > > > > sources.)
> > > > >
> > > > > Also, does a new NFS mount attempt against the server do
> > > > > anything?
> > > > >
> > > > > Thanks in advance for help with this, rick
> > > >
> > > > Hi Rick,
> > > >
> > > > Here's the output. In general, the nfsd processes appear to be
> > > > in
> > > > either nfsrvd_getcache(35 instances) or nfsrvd_updatecache(164)
> > > > sleeping on
> > > > "nfssrc". The server numbers don't appear to be moving. A
> > > > showmount
> > > > from a
> > > > client system works, but a mount does not (see below).
> > >
> > > Please try the attached patch and let me know if it helps. When I
> > > looked
> > > I found several places where the rc_flag variable was being
> > > fiddled without the
> > > mutex held. I suspect one of these resulted in the RC_LOCKED flag
> > > not
> > > getting cleared, so all the threads got stuck waiting on it.
> > >
> > > The patch is at:
> > >   http://people.freebsd.org/~rmacklem/cache.patch
> > > in case it gets eaten by the list handler.
> > > Thanks for digging into this, rick
> >
> > Hi Rick,
> >
> >    Patch applied. The system has been up and running for about
> > 16 hours now and so far it's still handling the load quite nicely.
> >
> > last pid: 15853; load averages: 5.36, 4.64, 4.48 up 0+16:08:16
> > 08:48:07
> > 72 processes: 7 running, 65 sleeping
> > CPU: % user, % nice, % system, % interrupt, % idle
> > Mem: 22M Active, 3345M Inact, 79G Wired, 9837M Buf, 11G Free
> > Swap:
> >
> >   PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> >  2049 root 26 52 0 10052K 1712K CPU3 3 97:21 942.24% nfsd
> >
> >    I'll followup again in 24 hours with another status.
> >
> >    Any performance related numbers/knobs we can provide that might
> > be of interest?
> >
> >    Thanks Rick.
> >
> > -John
> 
> Hi Rick,
> 
> We've run the nfs share patchs and produced some numbers, no errors
> seen. A few questions about times.
> 
> Four sets of repeated runs, each set run for about 2 days, with the
> following nfs options:
> 
> tcp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=0
> tcp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=2
> udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=0
> udp,nfsv3,rsize=32768,wsize=32768,noatime,nolockd,acregmin=1,acregmax=2,acdirmin=1,acdirmax=2,negnametimeo=2
> 
> negnametimeo=2 for /dat (src), 0 for /datgen (obj). Switched out tcp
> and udp, otherwise left options the same.
> 
> stock = stock system
> nolock = share patches
> 
I assume this means without/with the patch I sent you. If not, the following
will be completely bogus.

> Run Median build time (min) Max build time (min)
> Tcp.nolock: 11.2 11.6
> Tcp.stock: 13.6 20.8 (Some of these ran over cooling issues, ie:
> additional heat)
> Udp.nolock: 14.9 15.3
> Udp.stock: 20.6 20.7
> 
> Average nfsd cpu usage:
> 
> Tcp.nolock: 197.46
> Tcp.stock: 164.872
> Udp.nolock: 374.656
> Udp.stock: 339.156
> 
> These cpu numbers seem a bit confusing. Udp seems to have more
> overhead. The share patches
> seem faster walkclock timewise, but use more cpu.
> 
Ok, well without the patch, you were simply getting incorrect/buggy
behaviour, so I don't think those stats mean much. (I won't try and
guess what the buggy behaviour really was. The fun part of any caching
mechanism is that most bugs just affect perf.:-)

W.r.t. UDP vs TCP... for TCP, it is only necessary to keep the last
few (often only the last one) request(s) on the TCP connection, since the
only client side RPC retry will occur much later, for the case that a TCP
connection is broken by a long network partitioning or similar.

For UDP, it must keep all the replies for at least 1sec after being
sent, since UDP can drop requests/replies at any time. This is going
to be more work and use more mbufs etc.

> Thoughts?
> 
One final comment. For TCP, larger rsize/wsize than 32768 should work.
If you don't specify them as mount arguments, it will use the largest
supported by the client and server. Currently this is MAX_BSIZE == 64K
for FreeBSD, but I hope to try cranking that up soon. For Solaris10, it
is 1Mbyte.

The only situation I am aware of where larger rsize/wsize will result in
poorer perf. is when the network fabric can't handle the larger bursts
of data traffic. (Since you are running 32K UDP, I don't think that will
be an issue for you. If it happens, it is usually obvious, in that perf.
drop by an order of magnitude. --> reads or writes take so long you think
the server has crashed.)

rick

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1060425320.262543.1309962738973.JavaMail.root>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation