Date: Wed, 17 Mar 2021 21:37:20 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Jason Breitman <jbreitman@tildenparkcapital.com>, "freebsd-net@freebsd.org" <freebsd-net@freebsd.org> Subject: Re: NFS Mount Hangs Message-ID: <YQXPR0101MB0968DC18E00833DE2969C636DD6A9@YQXPR0101MB0968.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <33693DE3-7FF8-4FAB-9A75-75576B88A566@tildenparkcapital.com> References: <C643BB9C-6B61-4DAC-8CF9-CE04EA7292D0@tildenparkcapital.com> <3750001D-3F1C-4D9A-A9D9-98BCA6CA65A4@tildenparkcapital.com>, <33693DE3-7FF8-4FAB-9A75-75576B88A566@tildenparkcapital.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Jason Breitman wrote:=0A= >Please review the details below and let me know if there is a setting that= I should >apply to my FreeBSD NFS Server or if there is a bug fix that I c= an apply to resolve my >issue.=0A= >I shared this information with the linux-nfs mailing list and they believe= the issue is >on the server side.=0A= I actually lurk there and saw your post. I'll admit I smiled when Trond arg= ued=0A= that a hung Linux system is the result of a server failing to send a fin/ac= k for=0A= a closing TCP connection. But, here's a few comments..=0A= =0A= >Issue=0A= >NFSv4 mounts periodically hang on the NFS Client.=0A= >=0A= >During this time, it is possible to manually mount from another NFS Server= on the >NFS Client having issues.=0A= >Also, other NFS Clients are successfully mounting from the NFS Server in q= uestion.=0A= >Rebooting the NFS Client appears to be the only solution.=0A= >=0A= >Environment=0A= >NFS Server=0A= >OS: FreeBSD 12.1-RELEASE-p5=0A= >=0A= >NFS Client=0A= >OS: Debian Buster 10.8=0A= >Kernel: 4.19.171-2=0A= >Protocol: NFSv4 with Kerberos Security=0A= >Mount Options: nfs-server.domain.com:/data /mnt/data nfs4 >l= ookupcache=3Dpos,noresvport,sec=3Dkrb5,hard,rsize=3D1048576,wsize=3D1048576= 00=0A= The maximum I/O size supported by FreeBSD is 128K.=0A= The client should acquire the attributes that indicate that and set rsize/w= size=0A= to that. "# nfsstat -m" on the client should show you what the client=0A= is actually using. If it is larger than 128K, set both rsize and wsize to 1= 28K.=0A= =0A= >Output from the NFS Client when the issue occurs=0A= ># netstat -an | grep NFS.Server.IP.X=0A= >tcp 0 0 NFS.Client.IP.X:46896 NFS.Server.IP.X:2049 = FIN_WAIT2=0A= I'm no TCP guy. Hopefully others might know why the client would be=0A= stuck in FIN_WAIT2 (I vaguely recall this means it is waiting for a fin/ack= ,=0A= but could be wrong?)=0A= =0A= ># cat /sys/kernel/debug/sunrpc/rpc_xprt/*/info=0A= >netid: tcp=0A= >addr: NFS.Server.IP.X=0A= >port: 2049=0A= >state: 0x51=0A= >=0A= >syslog=0A= >Mar 4 10:29:27 hostname kernel: [437414.131978] -pid- flgs status -client= - --rqstp- ->timeout ---ops--=0A= >Mar 4 10:29:27 hostname kernel: [437414.133158] 57419 40a1 0 9b723c7= 3 >143cfadf 30000 4ca953b5 nfsv4 OPEN_NOATTR a:call_connect_status [sunr= pc] >q:xprt_pending=0A= I don't know what OPEN_NOATTR means, but I assume it is some variant=0A= of NFSv4 Open operation.=0A= [stuff snipped]=0A= >Mar 4 10:29:30 hostname kernel: [437417.110517] RPC: 57419 xprt_connect_s= tatus: >connect attempt timed out=0A= >Mar 4 10:29:30 hostname kernel: [437417.112172] RPC: 57419 call_connect_s= tatus =0A= >(status -110)=0A= I have no idea what status -110 means?=0A= >Mar 4 10:29:30 hostname kernel: [437417.113337] RPC: 57419 call_timeout (= major)=0A= >Mar 4 10:29:30 hostname kernel: [437417.114385] RPC: 57419 call_bind (sta= tus 0)=0A= >Mar 4 10:29:30 hostname kernel: [437417.115402] RPC: 57419 call_connect x= prt >00000000e061831b is not connected=0A= >Mar 4 10:29:30 hostname kernel: [437417.116547] RPC: 57419 xprt_connect x= prt >00000000e061831b is not connected=0A= >Mar 4 10:30:31 hostname kernel: [437478.551090] RPC: 57419 xprt_connect_s= tatus: >connect attempt timed out=0A= >Mar 4 10:30:31 hostname kernel: [437478.552396] RPC: 57419 call_connect_s= tatus >(status -110)=0A= >Mar 4 10:30:31 hostname kernel: [437478.553417] RPC: 57419 call_timeout (= minor)=0A= >Mar 4 10:30:31 hostname kernel: [437478.554327] RPC: 57419 call_bind (sta= tus 0)=0A= >Mar 4 10:30:31 hostname kernel: [437478.555220] RPC: 57419 call_connect x= prt >00000000e061831b is not connected=0A= >Mar 4 10:30:31 hostname kernel: [437478.556254] RPC: 57419 xprt_connect x= prt >00000000e061831b is not connected=0A= Is it possible that the client is trying to (re)connect using the same clie= nt port#?=0A= I would normally expect the client to create a new TCP connection using a= =0A= different client port# and then retry the outstanding RPCs.=0A= --> Capturing packets when this happens would show us what is going on.=0A= =0A= If there is a problem on the FreeBSD end, it is most likely a broken=0A= network device driver.=0A= --> Try disabling TSO , LRO.=0A= --> Try a different driver for the net hardware on the server.=0A= --> Try a different net chip on the server.=0A= If you can capture packets when (not after) the hang=0A= occurs, then you can look at them in wireshark and see=0A= what is actually happening. (Ideally on both client and=0A= server, to check that your network hasn't dropped anything.)=0A= --> I know, if the hangs aren't easily reproducible, this isn't=0A= easily done.=0A= --> Try a newer Linux kernel and see if the problem persists.=0A= The Linux folk will get more interested if you can reproduce=0A= the problem on 5.12. (Recent bakeathon testing of the 5.12=0A= kernel against the FreeBSD server did not find any issues.)=0A= =0A= Hopefully the network folk have some insight w.r.t. why=0A= the TCP connection is sitting in FIN_WAIT2.=0A= =0A= rick=0A= =0A= =0A= =0A= Jason Breitman=0A= =0A= =0A= =0A= =0A= =0A= =0A= _______________________________________________=0A= freebsd-net@freebsd.org mailing list=0A= https://lists.freebsd.org/mailman/listinfo/freebsd-net=0A= To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"=0A= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR0101MB0968DC18E00833DE2969C636DD6A9>