Date: Sat, 9 Mar 2013 12:34:50 -0500 From: Garrett Wollman <wollman@freebsd.org> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: freebsd-net@freebsd.org Subject: Re: Limits on jumbo mbuf cluster allocation Message-ID: <20795.29370.194678.963351@hergotha.csail.mit.edu> In-Reply-To: <1700261042.3728432.1362847830447.JavaMail.root@erie.cs.uoguelph.ca> References: <20794.37617.822910.93537@hergotha.csail.mit.edu> <1700261042.3728432.1362847830447.JavaMail.root@erie.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
<<On Sat, 9 Mar 2013 11:50:30 -0500 (EST), Rick Macklem <rmacklem@uoguelph.ca> said: > I suspect this indicates that it isn't mutex contention, since the > threads would block waiting for the mutex for that case, I think? No, because our mutexes are adaptive, so each thread spins for a while before blocking. With the current implementation, all of them end up doing this in pretty close to lock-step. > (Bumping up NFSRVCACHE_HASHSIZE can't hurt if/when you get the chance.) I already have it set to 129 (up from 20); I could see putting it up to, say, 1023. It would be nice to have a sysctl for maximum chain length to see how bad it's getting (and if the hash function is actually effective). > I've thought about this. My concern is that the separate thread might > not keep up with the trimming demand. If that occurred, the cache would > grow veryyy laarrggge, with effects like running out of mbuf clusters. At a minimum, once one nfsd thread is committed to doing the cache trim, a flag should be set to discourage other threads from trying to do it. Having them all spinning their wheels punishes the clients much too much. > By having the nfsd threads do it, they slow down, which provides feedback > to the clients (slower RPC replies->generate fewer request->less to cache). > (I think you are probably familiar with the generic concept that a system > needs feedback to remain stable. An M/M/1 queue with open arrivals and > no feedback to slow the arrival rate explodes when the arrival rate > approaches the service rate, etc and so on...) Unfortunately, the feedback channel that I have is: one user starts 500 virtual machines accessing a filesystem on the server -> other users of this server see their goodput go to zero -> everyone sends in angry trouble tickets -> I increase the DRC size manually. It would be nice if, by the time I next want to take a vacation, I have this figured out. I'm OK with throwing memory at the problem -- these servers have 96 GB and can hold up to 144 GB -- so long as I can find a tuning that provides stability and consistent, reasonable performance for the users. > The nfs server does soreserve(so, sb_max_adj, sb_max_adj); I can't > recall exactly why it is that way, except that it needs to be large > enough to handle the largest RPC request a client might generate. > I should take another look at this, in case sb_max_adj is now > too large? It probably shouldn't be larger than the net.inet.tcp.{send,recv}buf_max, and the read and write sizes that are negotiated should be chosen so that a whole RPC can fit in that space. If that's too hard for whatever reason, nfsd should at least log a message saying "hey, your socket buffer limits are too small, I'm going to ignore them". -GAWollman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20795.29370.194678.963351>