Date: Thu, 15 May 2014 11:22:50 -0600 From: "Janky Jay, III" <jankyj@unfs.us> To: Karl Pielorz <kpielorz_lst@tdx.co.uk>, freebsd-infiniband@freebsd.org Subject: Re: FBSD to FBSD NFS Mounts over IB. Message-ID: <5374F7EA.6060505@unfs.us> In-Reply-To: <55BC554716A7EA5C54F5DD02@study64.tdx.co.uk> References: <5374D431.5020501@unfs.us> <55BC554716A7EA5C54F5DD02@study64.tdx.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
Hello Karl, On 05/15/2014 10:50 AM, Karl Pielorz wrote: > > > --On 15 May 2014 08:50:25 -0600 "Janky Jay, III" <jankyj@unfs.us> wrote: > >> I have set up one of the FBSD systems to run OpenSM and also be an >> NFS >> server which all the systems seem to be able to mount over the IB devices >> without any issue at all. Small reads and writes to and from the NFS >> server to all the other nodes also seems to work without any issue. >> However, if I try to dump large amounts of data using "dd" (in order to >> test speeds and stability), the FBSD NFS client craps out immediately. I >> just get the following message(s) over and over: >> >> newnfs server 10.11.1.1:/data: not responding >> newnfs server 10.11.1.1:/data: not responding > > Can both sides 'ping' each other when this happens? > I just tested this while node2 was hanging with another NFS transfer (just a "cp /home/file /data/file") and both nodes (1 and 2) can ping each other without any issues. > The reason I ask is I've hit a similar issue setting up ZFS over iSCSI > on IPOIB (I'm not running connected mode). > > At my end it looks like an ARP expires or something so the sides 'lose > site' of each other. If 'A' can't see 'B' - a ping from 'B' to 'A' > usually restores the connection. > I've actually seen this a lot with OpenVPN as well. For some reason, the ARP does seem to expire or something and I can no longer reach other systems on the LAN. And, like you, simply pinging the IP/hostname resolves the issue and I am able to connect to services on the LAN again... Very strange. > Maybe make sure they can both still see each other outside of nfs - I > can temporarily 'fix' the issue here by leaving both sides pinging each > other - I've not really had a chance to look at it much recently... > I'll leave them both pinging each other for a while and see if the transfer ends up finishing. Thanks for the info! Also, I found a lot of the below in the /var/log/messages log files on both FBSD systems over and over but they seem very random (time-wise): May 15 03:14:26 node1 kernel: "received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 15\n" May 15 03:17:43 node1 kernel: "received MAD: slid:3 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 15\n" May 15 03:17:50 node1 kernel: ib0: timing out; 2 sends not completed May 15 03:18:22 node1 kernel: "received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 10\n"<7>"received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 14\n" May 15 03:19:26 node1 kernel: "received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 15\n" May 15 03:26:10 node1 kernel: "received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 15\n" Regards, Janky Jay, III
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5374F7EA.6060505>