Date: Sat, 23 Feb 2013 10:22:06 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: Daniel Braniss <danny@cs.huji.ac.il> Cc: freebsd-stable@freebsd.org Subject: Re: zfs/nfs/proftpd problem Message-ID: <807546569.3234119.1361632926179.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <E1U9AjU-0005ZX-Js@kabab.cs.huji.ac.il>
next in thread | previous in thread | raw e-mail | index | archive | help
Daniel Braniss wrote: > > Daniel Braniss wrote: > > > > Daniel Braniss wrote: > > > > > after upgrading the 'ftp storage' from 8.3 to 9.1-stable, our > > > > > ftp > > > > > server is stuck. > > > > > > > > > > the old, (ProFTPD Version 1.3.2) and working till before the > > > > > upgrade > > > > > is stuck > > > > > in nlmrcv: > > > > > ... > > > > > 10000 1213 992 0 44 0 7340 3692 nlmrcv D ?? 0:08.07 proftpd: > > > > > ftp - > > > > > crawl-66-249-73-193.googlebot.com: > > > > > anonymous/googlebot@google.com: > > > > > RETR 00690145.JPG (proftpd) > > > > > ... > > > > > > > > > I suspect you know that this is waiting for a reply from some > > > > rpc.lockd. > > > > > > > > > so we upgraded the ftp server too, to 9.1/ProFTPD Version > > > > > 1.3.4b > > > > > and > > > > > this one > > > > > is stuck in rpccwnd: > > > > > 10000 1197 984 0 20 0 32292 4792 rpccwnd D ?? 0:00.01 proftpd: > > > > > ftp > > > > > - > > > > > mbpro.cs.huji.ac.il: anonymous/mozilla@example.com: LIST > > > > > (proftpd) > > > > > > > > > This one is stuck in the client side of UDP for the krpc, in the > > > > primitive congestion control stuff that is there. > > > may be it's too primitive? > > > > > Yes, but the only alternative is no congestion avoidance at all. The > > RPC RTT includes round trip time for the messages plus the delay for > > processing the RPC at the server. The latter is highly variable and > > depends greatly on what the RPC is and how heavily loaded the server > > is. (The pre-krpc NFS client could do a little better, since it > > "knew" > > what the RPC was and could assume "writes" would take a lot longer > > than a Getattr. A generic krpc implementation can't know anything > > about what the RPC does.) > > > > If you're network fabric needs congestion control to achieve low > > loss of packets, then TCP is the way to go. Remember that, if any > > packet in a request/response is lost, the entire RPC must be retried > > when running over UDP. > > > > > > > > > > > > > > > > any wise suggestions :-) > > > > > > > > > Well, maybe not wise, but you may already be aware that NFS etc > > > > over > > > > UDP and the NLM are two of my favourite things (especially the > > > > NLM). > > > > > > > > Basically, it appears to be having difficulties doing RPCs over > > > > UDP, > > > > at least for the NLM (rpc.lockd), suggesting some transport > > > > related > > > > issue. > > > > > > > > First, make sure rpc.statd and rpc.lockd are running on the NFS > > > > server > > > > and all clients (or disable use of it via the "nolockd" mount > > > > option). > > > all are ruuning bot rpc.statd and rpc.lockd > > > > > > > > You can also do a "netstat -s" and see if there is a non-zero > > > > count > > > > for "fragments dropped due to timeout" in the IP section. (This > > > > happens > > > > when your network fabric can't handle the burst of IP fragments > > > > generated by a large RPC message over UDP.) > > > > > > > > > > there are none on the cliet (the ftp server) > > > > > > > Things you could try: > > > > - If you are using a udp mount for NFS... > > > > - reduce your rsize and wsize (especially if "fragments > > > > dropped > > > > due > > > > to timeout" is non-zero) > > > > or > > > > - switch to TCP > > > > > > > > If you are not using udp mounts, then the NLM (rpc.lockd) is > > > > using > > > > UDP anyhow. If you don't need multiple NFS clients to see the > > > > file > > > > locks, add "nolockd" to your mount(s). > > > > > > > > Beyond that, you'll need to capture packets and look at them in > > > > wireshark, to see what is going on. > > > > > > > the mount is tcp. > > > I have been staring at the tcpdump and nothing sticks out, but > > > it's > > > been a > > > while > > > since I looked at rpc traffic. > > > > > I'll assume you are looking at it using wireshark (tcpdump doesn't > > understand > > these protocols). You would be looking for repeated RPC request > > messages without > > a corresponding RPC reply from the other end. They would be NLM or > > NSM protocol > > RPCs. (Since you mentioned that your NFS mount was TCP, it must be > > the NLM and/or > > associated NSM stuff that is using UDP, I think.) > > > yup, wireshark, i only used tcpdump to capture since the link is slow > to run wrireshark over it. > > > > some facts: > > > it happens every time, with any ftp command, it gets stuck on > > > either > > > nlmrcv > > > or rpccwnd, mostly the latter. > > > I will try to disable the lock stuff, but isn't it avoiding the > > > issue? > > > > > If you mean "avoiding the use of a fundamentally flawed protocol > > designed > > in the 1980s for a handful of locally connected machines that always > > remain up with the same fixed hostname/ip address" then, yes, you > > are trying to > > avoid the issue. Further to that, the protocol was never well > > published, so > > implementations "guessed" w.r.t. the semantics for things like "how > > many > > times should the NSM try to ping another machine before assuming the > > other > > machine has crashed and lost the file lock. > > ah, where are the days of ND - keep forgetting how old NFS is (and me > too :-) > > > > > One of the main parts of NFSv4 was an effort to fix file > > locking. Although it isn't widely adopted yet, it is now a 10year > > old > > protocol. (RFC3530 is dated April 2003, if I recall correctly.) > > > > Since rpc.lockd and rpc.statd haven't changed much in a long time > > and > > are essentially the same in 8.3 as stable/9, I'd suspect that > > something > > else has broken this (assuming it worked fine for 8.3). I would > > suspect > > the network device driver for your hardware and I'd suggest trying > > things like disabling checksum offload options, TSO and anything > > else > > you can try via ifconfig. If you happen to have a different kind of > > network hardware port, I'd try switching to that as well. > > I will have to sniff the packets on the ftp server too, it's > complaining the lockd is not respondig, so maybe soemthing is lost on > the way. > > setting nolockd on the mount solved the problem! > > now, if you are willing to help, I can continue > experimenting/debuging, since > it's very easy to cause the problem, and I think there are more issues > here, since other servers are complaining too, but there its more > difficult > to debug, intermitent no responding etc. > Well, I won't be of much help. I am away from home, so I don't have wireshark available and I'm not that familiar with the NLM and NSM protocols (plus you've already figured out what I think of them;-). They seem to work ok when all the machines stay up and the network delivers packets reliably between them. I'd suspect some sort of network layer issue. A couple of possibilities: - UDP checksum problems - IP broadcast problems (I'm pretty sure the NSM and maybe NLM depend on broadcast working.) Good luck with it, rick > thanks and cheers, > danny > > > > rick > > > > > > Good luck with it, rick > > > thanks, > > > danny > > > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?807546569.3234119.1361632926179.JavaMail.root>