From owner-freebsd-net@freebsd.org Sun Apr 4 17:27:33 2021 Return-Path: Delivered-To: freebsd-net@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 5BEF25AD078 for ; Sun, 4 Apr 2021 17:27:33 +0000 (UTC) (envelope-from freebsd-rwg@gndrsh.dnsmgr.net) Received: from gndrsh.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4FD12D4vMwz3hgM; Sun, 4 Apr 2021 17:27:31 +0000 (UTC) (envelope-from freebsd-rwg@gndrsh.dnsmgr.net) Received: from gndrsh.dnsmgr.net (localhost [127.0.0.1]) by gndrsh.dnsmgr.net (8.13.3/8.13.3) with ESMTP id 134HRTjO097116; Sun, 4 Apr 2021 10:27:29 -0700 (PDT) (envelope-from freebsd-rwg@gndrsh.dnsmgr.net) Received: (from freebsd-rwg@localhost) by gndrsh.dnsmgr.net (8.13.3/8.13.3/Submit) id 134HRTbA097115; Sun, 4 Apr 2021 10:27:29 -0700 (PDT) (envelope-from freebsd-rwg) From: "Rodney W. Grimes" Message-Id: <202104041727.134HRTbA097115@gndrsh.dnsmgr.net> Subject: Re: NFS Mount Hangs In-Reply-To: To: Rick Macklem Date: Sun, 4 Apr 2021 10:27:28 -0700 (PDT) CC: "Scheffenegger, Richard" , "tuexen@freebsd.org" , Youssef GHORBAL , "freebsd-net@freebsd.org" X-Mailer: ELM [version 2.4ME+ PL121h (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: 4FD12D4vMwz3hgM X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 04 Apr 2021 17:27:33 -0000 And I'll follow the lead to top post, as I have been quietly following this thread, trying to only add when I think I have relevant input, and I think I do on a small point... Rick, Your "unplugging" a cable to simulate network partitioning, in my experience this is a bad way to do that, as the host gets to see the link layer go down and knows it can not send. I am actually puzzled that you see arp packets, but I guess those are getting picked off before the interface layer silently tosses them on the ground. IIRC due to this loss of link layer you may be masking some things that would occur in other situations as often an error is returned to the application layer. IE the ONLY packet your likely to see into an unplugged cable is "arp". I can suggest other means to partition, such as configuring a switch port in and out of the correct LAN/VLAN, a physical switch in the TX pair to open it, but leave RX pair intact so carrier is not lost. Both of these simulate partitioning that is more realistic, AND does not have the side effect of allowing upper layers to eat the packets before bpf can grab them, or be told that partitioning has occured. Another side effect of unplugging a cable is that a host should immediately invalidate all ARP entries on that interface... hence your getting into an arp who has situation that should not even start for 5 minutes in the other failure modes. Regards, Rod > Well, I'm going to cheat and top post, since this is elated info. and > not really part of the discussion... > > I've been testing network partitioning between a Linux client (5.2 kernel) > and a FreeBSD-current NFS server. I have not gotten a solid hang, but > I have had the Linux client doing "battle" with the FreeBSD server for > several minutes after un-partitioning the connection. > > The battle basically consists of the Linux client sending an RST, followed > by a SYN. > The FreeBSD server ignores the RST and just replies with the same old ack. > --> This varies from "just a SYN" that succeeds to 100+ cycles of the above > over several minutes. > > I had thought that an RST was a "pretty heavy hammer", but FreeBSD seems > pretty good at ignoring it. > > A full packet capture of one of these is in /home/rmacklem/linuxtofreenfs.pcap > in case anyone wants to look at it. > > Here's a tcpdump snippet of the interesting part (see the *** comments): > 19:10:09.305775 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 202585:202749, ack 212293, win 29128, options [nop,nop,TS val 2073636037 ecr 2671204825], length 164: NFS reply xid 613153685 reply ok 160 getattr NON 4 ids 0/33554432 sz 0 > 19:10:09.305850 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 202749, win 501, options [nop,nop,TS val 2671204825 ecr 2073636037], length 0 > *** Network is now partitioned... > > 19:10:09.407840 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671204927 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53 > 19:10:09.615779 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205135 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53 > 19:10:09.823780 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 212293:212525, ack 202749, win 501, options [nop,nop,TS val 2671205343 ecr 2073636037], length 232: NFS request xid 629930901 228 getattr fh 0,1/53 > *** Lots of lines snipped. > > > 19:13:41.295783 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28 > 19:13:42.319767 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28 > 19:13:46.351966 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28 > 19:13:47.375790 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28 > 19:13:48.399786 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28 > *** Network is now unpartitioned... > > 19:13:48.399990 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46 > 19:13:48.400002 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671421871 ecr 0,nop,wscale 7], length 0 > 19:13:48.400185 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073855137 ecr 2671204825], length 0 > 19:13:48.400273 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0 > 19:13:49.423833 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671424943 ecr 0,nop,wscale 7], length 0 > 19:13:49.424056 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073856161 ecr 2671204825], length 0 > *** This "battle" goes on for 223sec... > I snipped out 13 cycles of this "Linux sends an RST, followed by SYN" > "FreeBSD replies with same old ACK". In another test run I saw this > cycle continue non-stop for several minutes. This time, the Linux > client paused for a while (see ARPs below). > > 19:13:49.424101 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [R], seq 964161458, win 0, length 0 > 19:13:53.455867 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 416692300, win 64240, options [mss 1460,sackOK,TS val 2671428975 ecr 0,nop,wscale 7], length 0 > 19:13:53.455991 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 212293, win 29127, options [nop,nop,TS val 2073860193 ecr 2671204825], length 0 > *** Snipped a bunch of stuff out, mostly ARPs, plus one more RST. > > 19:16:57.775780 ARP, Request who-has nfsv4-new3.home.rick tell nfsv4-linux.home.rick, length 28 > 19:16:57.775937 ARP, Reply nfsv4-new3.home.rick is-at d4:be:d9:07:81:72 (oui Unknown), length 46 > 19:16:57.980240 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46 > 19:16:58.555663 ARP, Request who-has nfsv4-new3.home.rick tell 192.168.1.254, length 46 > 19:17:00.104701 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074046846 ecr 2671204825], length 0 > 19:17:15.664354 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [F.], seq 202749, ack 212293, win 29128, options [nop,nop,TS val 2074062406 ecr 2671204825], length 0 > 19:17:31.239246 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [R.], seq 202750, ack 212293, win 0, options [nop,nop,TS val 2074077981 ecr 2671204825], length 0 > *** FreeBSD finally acknowledges the RST 38sec after Linux sent the last > of 13 (100+ for another test run). > > 19:17:51.535979 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [S], seq 4247692373, win 64240, options [mss 1460,sackOK,TS val 2671667055 ecr 0,nop,wscale 7], length 0 > 19:17:51.536130 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [S.], seq 661237469, ack 4247692374, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 2074098278 ecr 2671667055], length 0 > *** Now back in business... > > 19:17:51.536218 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [.], ack 1, win 502, options [nop,nop,TS val 2671667055 ecr 2074098278], length 0 > 19:17:51.536295 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 1:233, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 232: NFS request xid 629930901 228 getattr fh 0,1/53 > 19:17:51.536346 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 233:505, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098278], length 272: NFS request xid 697039765 132 getattr fh 0,1/53 > 19:17:51.536515 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [.], ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 0 > 19:17:51.536553 IP nfsv4-linux.home.rick.apex-mesh > nfsv4-new3.home.rick.nfsd: Flags [P.], seq 505:641, ack 1, win 502, options [nop,nop,TS val 2671667056 ecr 2074098279], length 136: NFS request xid 730594197 132 getattr fh 0,1/53 > 19:17:51.536562 IP nfsv4-new3.home.rick.nfsd > nfsv4-linux.home.rick.apex-mesh: Flags [P.], seq 1:49, ack 505, win 29128, options [nop,nop,TS val 2074098279 ecr 2671667056], length 48: NFS reply xid 697039765 reply ok 44 getattr ERROR: unk 10063 > > This error 10063 after the partition heals is also "bad news". It indicates the Session > (which is supposed to maintain "exactly once" RPC semantics is broken). I'll admit I > suspect a Linux client bug, but will be investigating further. > > So, hopefully TCP conversant folk can confirm if the above is correct behaviour > or if the RST should be ack'd sooner? > > I could also see this becoming a "forever" TCP battle for other versions of Linux client. > > rick > > > ________________________________________ > From: Scheffenegger, Richard > Sent: Sunday, April 4, 2021 7:50 AM > To: Rick Macklem; tuexen@freebsd.org > Cc: Youssef GHORBAL; freebsd-net@freebsd.org > Subject: Re: NFS Mount Hangs > > CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca > > > For what it?s worth, suse found two bugs in the linux nfconntrack (stateful firewall), and pfifo-fast scheduler, which could conspire to make tcp sessions hang forever. > > One is a missed updaten when the c?ient is not using the noresvport moint option, which makes tje firewall think rsts are illegal (and drop them); > > The fast scheduler can run into an issue if only a single packet should be forwarded (note that this is not the default scheduler, but often recommended for perf, as it runs lockless and lower cpu cost that pfq (default). If no other/additional packet pushes out that last packet of a flow, it can become stuck forever... > > I can try getting the relevant bug info next week... > > ________________________________ > Von: owner-freebsd-net@freebsd.org im Auftrag von Rick Macklem > Gesendet: Friday, April 2, 2021 11:31:01 PM > An: tuexen@freebsd.org > Cc: Youssef GHORBAL ; freebsd-net@freebsd.org > Betreff: Re: NFS Mount Hangs > > NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. > > > > > tuexen@freebsd.org wrote: > >> On 2. Apr 2021, at 02:07, Rick Macklem wrote: > >> > >> I hope you don't mind a top post... > >> I've been testing network partitioning between the only Linux client > >> I have (5.2 kernel) and a FreeBSD server with the xprtdied.patch > >> (does soshutdown(..SHUT_WR) when it knows the socket is broken) > >> applied to it. > >> > >> I'm not enough of a TCP guy to know if this is useful, but here's what > >> I see... > >> > >> While partitioned: > >> On the FreeBSD server end, the socket either goes to CLOSED during > >> the network partition or stays ESTABLISHED. > >If it goes to CLOSED you called shutdown(, SHUT_WR) and the peer also > >sent a FIN, but you never called close() on the socket. > >If the socket stays in ESTABLISHED, there is no communication ongoing, > >I guess, and therefore the server does not even detect that the peer > >is not reachable. > >> On the Linux end, the socket seems to remain ESTABLISHED for a > >> little while, and then disappears. > >So how does Linux detect the peer is not reachable? > Well, here's what I see in a packet capture in the Linux client once > I partition it (just unplug the net cable): > - lots of retransmits of the same segment (with ACK) for 54sec > - then only ARP queries > > Once I plug the net cable back in: > - ARP works > - one more retransmit of the same segement > - receives RST from FreeBSD > ** So, is this now a "new" TCP connection, despite > using the same port#. > --> It matters for NFS, since "new connection" > implies "must retry all outstanding RPCs". > - sends SYN > - receives SYN, ACK from FreeBSD > --> connection starts working again > Always uses same port#. > > On the FreeBSD server end: > - receives the last retransmit of the segment (with ACK) > - sends RST > - receives SYN > - sends SYN, ACK > > I thought that there was no RST in the capture I looked at > yesterday, so I'm not sure if FreeBSD always sends an RST, > but the Linux client behaviour was the same. (Sent a SYN, etc). > The socket disappears from the Linux "netstat -a" and I > suspect that happens after about 54sec, but I am not sure > about the timing. > > >> > >> After unpartitioning: > >> On the FreeBSD server end, you get another socket showing up at > >> the same port# > >> Active Internet connections (including servers) > >> Proto Recv-Q Send-Q Local Address Foreign Address (state) > >> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 ESTABLISHED > >> tcp4 0 0 nfsv4-new3.nfsd nfsv4-linux.678 CLOSED > >> > >> The Linux client shows the same connection ESTABLISHED. > But disappears from "netstat -a" for a while during the partitioning. > > >> (The mount sometimes reports an error. I haven't looked at packet > >> traces to see if it retries RPCs or why the errors occur.) > I have now done so, as above. > > >> --> However I never get hangs. > >> Sometimes it goes to SYN_SENT for a while and the FreeBSD server > >> shows FIN_WAIT_1, but then both ends go to ESTABLISHED and the > >> mount starts working again. > >> > >> The most obvious thing is that the Linux client always keeps using > >> the same port#. (The FreeBSD client will use a different port# when > >> it does a TCP reconnect after no response from the NFS server for > >> a little while.) > >> > >> What do those TCP conversant think? > >I guess you are you are never calling close() on the socket, for with > >the connection state is CLOSED. > Ok, that makes sense. For this case the Linux client has not done a > BindConnectionToSession to re-assign the back channel. > I'll have to bug them about this. However, I'll bet they'll answer > that I have to tell them the back channel needs re-assignment > or something like that. > > I am pretty certain they are broken, in that the client needs to > retry all outstanding RPCs. > > For others, here's the long winded version of this that I just > put on the phabricator review: > In the server side kernel RPC, the socket (struct socket *) is in a > structure called SVCXPRT (normally pointed to by "xprt"). > These structures a ref counted and the soclose() is done > when the ref. cnt goes to zero. My understanding is that > "struct socket *" is free'd by soclose() so this cannot be done > before the xprt ref. cnt goes to zero. > > For NFSv4.1/4.2 there is something called a back channel > which means that a "xprt" is used for server->client RPCs, > although the TCP connection is established by the client > to the server. > --> This back channel holds a ref cnt on "xprt" until the > > client re-assigns it to a different TCP connection > via an operation called BindConnectionToSession > and the Linux client is not doing this soon enough, > it appears. > > So, the soclose() is delayed, which is why I think the > TCP connection gets stuck in CLOSE_WAIT and that is > why I've added the soshutdown(..SHUT_WR) calls, > which can happen before the client gets around to > re-assigning the back channel. > > Thanks for your help with this Michael, rick > > Best regards > Michael > > > > rick > > ps: I can capture packets while doing this, if anyone has a use > > for them. > > > > > > > > > > > > > > ________________________________________ > > From: owner-freebsd-net@freebsd.org on behalf of Youssef GHORBAL > > Sent: Saturday, March 27, 2021 6:57 PM > > To: Jason Breitman > > Cc: Rick Macklem; freebsd-net@freebsd.org > > Subject: Re: NFS Mount Hangs > > > > CAUTION: This email originated from outside of the University of Guelph. Do not click links or open attachments unless you recognize the sender and know the content is safe. If in doubt, forward suspicious emails to IThelp@uoguelph.ca > > > > > > > > > > On 27 Mar 2021, at 13:20, Jason Breitman > wrote: > > > > The issue happened again so we can say that disabling TSO and LRO on the NIC did not resolve this issue. > > # ifconfig lagg0 -rxcsum -rxcsum6 -txcsum -txcsum6 -lro -tso -vlanhwtso > > # ifconfig lagg0 > > lagg0: flags=8943 metric 0 mtu 1500 > > options=8100b8 > > > > We can also say that the sysctl settings did not resolve this issue. > > > > # sysctl net.inet.tcp.fast_finwait2_recycle=1 > > net.inet.tcp.fast_finwait2_recycle: 0 -> 1 > > > > # sysctl net.inet.tcp.finwait2_timeout=1000 > > net.inet.tcp.finwait2_timeout: 60000 -> 1000 > > > > I don?t think those will do anything in your case since the FIN_WAIT2 are on the client side and those sysctls are for BSD. > > By the way it seems that Linux recycles automatically TCP sessions in FIN_WAIT2 after 60 seconds (sysctl net.ipv4.tcp_fin_timeout) > > > > tcp_fin_timeout (integer; default: 60; since Linux 2.2) > > This specifies how many seconds to wait for a final FIN > > packet before the socket is forcibly closed. This is > > strictly a violation of the TCP specification, but > > required to prevent denial-of-service attacks. In Linux > > 2.2, the default value was 180. > > > > So I don?t get why it stucks in the FIN_WAIT2 state anyway. > > > > You really need to have a packet capture during the outage (client and server side) so you?ll get over the wire chat and start speculating from there. > > No need to capture the beginning of the outage for now. All you have to do, is run a tcpdump for 10 minutes or so when you notice a client stuck. > > > > * I have not rebooted the NFS Server nor have I restarted nfsd, but do not believe that is required as these settings are at the TCP level and I would expect new sessions to use the updated settings. > > > > The issue occurred after 5 days following a reboot of the client machines. > > I ran the capture information again to make use of the situation. > > > > #!/bin/sh > > > > while true > > do > > /bin/date >> /tmp/nfs-hang.log > > /bin/ps axHl | grep nfsd | grep -v grep >> /tmp/nfs-hang.log > > /usr/bin/procstat -kk 2947 >> /tmp/nfs-hang.log > > /usr/bin/procstat -kk 2944 >> /tmp/nfs-hang.log > > /bin/sleep 60 > > done > > > > > > On the NFS Server > > Active Internet connections (including servers) > > Proto Recv-Q Send-Q Local Address Foreign Address (state) > > tcp4 0 0 NFS.Server.IP.X.2049 NFS.Client.IP.X.48286 CLOSE_WAIT > > > > On the NFS Client > > tcp 0 0 NFS.Client.IP.X:48286 NFS.Server.IP.X:2049 FIN_WAIT2 > > > > > > > > You had also asked for the output below. > > > > # nfsstat -E -s > > BackChannelCtBindConnToSes > > 0 0 > > > > # sysctl vfs.nfsd.request_space_throttle_count > > vfs.nfsd.request_space_throttle_count: 0 > > > > I see that you are testing a patch and I look forward to seeing the results. > > > > > > Jason Breitman > > > > > > On Mar 21, 2021, at 6:21 PM, Rick Macklem > wrote: > > > > Youssef GHORBAL > wrote: > >> Hi Jason, > >> > >>> On 17 Mar 2021, at 18:17, Jason Breitman > wrote: > >>> > >>> Please review the details below and let me know if there is a setting that I should apply to my FreeBSD NFS Server or if there is a bug fix that I can apply to resolve my issue. > >>> I shared this information with the linux-nfs mailing list and they believe the issue is on the server side. > >>> > >>> Issue > >>> NFSv4 mounts periodically hang on the NFS Client. > >>> > >>> During this time, it is possible to manually mount from another NFS Server on the NFS Client having issues. > >>> Also, other NFS Clients are successfully mounting from the NFS Server in question. > >>> Rebooting the NFS Client appears to be the only solution. > >> > >> I had experienced a similar weird situation with periodically stuck Linux NFS clients >mounting Isilon NFS servers (Isilon is FreeBSD based but they seem to have there >own nfsd) > > Yes, my understanding is that Isilon uses a proprietary user space nfsd and > > not the kernel based RPC and nfsd in FreeBSD. > > > >> We?ve had better luck and we did manage to have packet captures on both sides >during the issue. The gist of it goes like follows: > >> > >> - Data flows correctly between SERVER and the CLIENT > >> - At some point SERVER starts decreasing it's TCP Receive Window until it reachs 0 > >> - The client (eager to send data) can only ack data sent by SERVER. > >> - When SERVER was done sending data, the client starts sending TCP Window >Probes hoping that the TCP Window opens again so he can flush its buffers. > >> - SERVER responds with a TCP Zero Window to those probes. > > Having the window size drop to zero is not necessarily incorrect. > > If the server is overloaded (has a backlog of NFS requests), it can stop doing > > soreceive() on the socket (so the socket rcv buffer can fill up and the TCP window > > closes). This results in "backpressure" to stop the NFS client from flooding the > > NFS server with requests. > > --> However, once the backlog is handled, the nfsd should start to soreceive() > > again and this shouls cause the window to open back up. > > --> Maybe this is broken in the socket/TCP code. I quickly got lost in > > tcp_output() when it decides what to do about the rcvwin. > > > >> - After 6 minutes (the NFS server default Idle timeout) SERVER racefully closes the >TCP connection sending a FIN Packet (and still a TCP Window 0) > > This probably does not happen for Jason's case, since the 6minute timeout > > is disabled when the TCP connection is assigned as a backchannel (most likely > > the case for NFSv4.1). > > > >> - CLIENT ACK that FIN. > >> - SERVER goes in FIN_WAIT_2 state > >> - CLIENT closes its half part part of the socket and goes in LAST_ACK state. > >> - FIN is never sent by the client since there still data in its SendQ and receiver TCP >Window is still 0. At this stage the client starts sending TCP Window Probes again >and again hoping that the server opens its TCP Window so it can flush it's buffers >and terminate its side of the socket. > >> - SERVER keeps responding with a TCP Zero Window to those probes. > >> => The last two steps goes on and on for hours/days freezing the NFS mount bound >to that TCP session. > >> > >> If we had a situation where CLIENT was responsible for closing the TCP Window (and >initiating the TCP FIN first) and server wanting to send data we?ll end up in the same >state as you I think. > >> > >> We?ve never had the root cause of why the SERVER decided to close the TCP >Window and no more acccept data, the fix on the Isilon part was to recycle more >aggressively the FIN_WAIT_2 sockets (net.inet.tcp.fast_finwait2_recycle=1 & >net.inet.tcp.finwait2_timeout=5000). Once the socket recycled and at the next >occurence of CLIENT TCP Window probe, SERVER sends a RST, triggering the >teardown of the session on the client side, a new TCP handchake, etc and traffic >flows again (NFS starts responding) > >> > >> To avoid rebooting the client (and before the aggressive FIN_WAIT_2 was >implemented on the Isilon side) we?ve added a check script on the client that detects >LAST_ACK sockets on the client and through iptables rule enforces a TCP RST, >Something like: -A OUTPUT -p tcp -d $nfs_server_addr --sport $local_port -j REJECT >--reject-with tcp-reset (the script removes this iptables rule as soon as the LAST_ACK >disappears) > >> > >> The bottom line would be to have a packet capture during the outage (client and/or >server side), it will show you at least the shape of the TCP exchange when NFS is >stuck. > > Interesting story and good work w.r.t. sluething, Youssef, thanks. > > > > I looked at Jason's log and it shows everything is ok w.r.t the nfsd threads. > > (They're just waiting for RPC requests.) > > However, I do now think I know why the soclose() does not happen. > > When the TCP connection is assigned as a backchannel, that takes a reference > > cnt on the structure. This refcnt won't be released until the connection is > > replaced by a BindConnectiotoSession operation from the client. But that won't > > happen until the client creates a new TCP connection. > > --> No refcnt release-->no refcnt of 0-->no soclose(). > > > > I've created the attached patch (completely different from the previous one) > > that adds soshutdown(SHUT_WR) calls in the three places where the TCP > > connection is going away. This seems to get it past CLOSE_WAIT without a > > soclose(). > > --> I know you are not comfortable with patching your server, but I do think > > this change will get the socket shutdown to complete. > > > > There are a couple more things you can check on the server... > > # nfsstat -E -s > > --> Look for the count under "BindConnToSes". > > --> If non-zero, backchannels have been assigned > > # sysctl -a | fgrep request_space_throttle_count > > --> If non-zero, the server has been overloaded at some point. > > > > I think the attached patch might work around the problem. > > The code that should open up the receive window needs to be checked. > > I am also looking at enabling the 6minute timeout when a backchannel is > > assigned. > > > > rick > > > > Youssef > > > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > https://urldefense.com/v3/__https://lists.freebsd.org/mailman/listinfo/freebsd-net__;!!JFdNOqOXpB6UZW0!_c2MFNbir59GXudWPVdE5bNBm-qqjXeBuJ2UEmFv5OZciLj4ObR_drJNv5yryaERfIbhKR2d$ > > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > > > > > > > > > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > > _______________________________________________ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > > -- Rod Grimes rgrimes@freebsd.org