From owner-freebsd-infiniband@FreeBSD.ORG Thu May 15 17:30:24 2014 Return-Path: Delivered-To: freebsd-infiniband@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 75414795 for ; Thu, 15 May 2014 17:30:24 +0000 (UTC) Received: from morbid.purplehat.org (morbid.purplehat.org [206.225.82.173]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 43EBB2EEE for ; Thu, 15 May 2014 17:30:24 +0000 (UTC) Received: from localhost (morbid.purplehat.org [206.225.82.173]) by morbid.purplehat.org (Postfix) with ESMTP id B0660D8CCEE; Thu, 15 May 2014 10:23:19 -0700 (MST) Received: from morbid.purplehat.org ([206.225.82.173]) by localhost (morbid.purplehat.org [206.225.82.173]) (maiad, port 10024) with ESMTP id 85263-04; Thu, 15 May 2014 10:23:14 -0700 (MST) Received: from [172.17.0.60] (fw1.aspsys.com [173.14.5.129]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: jankyj@unfs.us) by morbid.purplehat.org (Postfix) with ESMTPSA id 84014D8CCF6; Thu, 15 May 2014 10:23:14 -0700 (MST) Message-ID: <5374F7EA.6060505@unfs.us> Date: Thu, 15 May 2014 11:22:50 -0600 From: "Janky Jay, III" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 MIME-Version: 1.0 To: Karl Pielorz , freebsd-infiniband@freebsd.org Subject: Re: FBSD to FBSD NFS Mounts over IB. References: <5374D431.5020501@unfs.us> <55BC554716A7EA5C54F5DD02@study64.tdx.co.uk> In-Reply-To: <55BC554716A7EA5C54F5DD02@study64.tdx.co.uk> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: Maia Mailguard X-BeenThere: freebsd-infiniband@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Infiniband on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 May 2014 17:30:24 -0000 Hello Karl, On 05/15/2014 10:50 AM, Karl Pielorz wrote: > > > --On 15 May 2014 08:50:25 -0600 "Janky Jay, III" wrote: > >> I have set up one of the FBSD systems to run OpenSM and also be an >> NFS >> server which all the systems seem to be able to mount over the IB devices >> without any issue at all. Small reads and writes to and from the NFS >> server to all the other nodes also seems to work without any issue. >> However, if I try to dump large amounts of data using "dd" (in order to >> test speeds and stability), the FBSD NFS client craps out immediately. I >> just get the following message(s) over and over: >> >> newnfs server 10.11.1.1:/data: not responding >> newnfs server 10.11.1.1:/data: not responding > > Can both sides 'ping' each other when this happens? > I just tested this while node2 was hanging with another NFS transfer (just a "cp /home/file /data/file") and both nodes (1 and 2) can ping each other without any issues. > The reason I ask is I've hit a similar issue setting up ZFS over iSCSI > on IPOIB (I'm not running connected mode). > > At my end it looks like an ARP expires or something so the sides 'lose > site' of each other. If 'A' can't see 'B' - a ping from 'B' to 'A' > usually restores the connection. > I've actually seen this a lot with OpenVPN as well. For some reason, the ARP does seem to expire or something and I can no longer reach other systems on the LAN. And, like you, simply pinging the IP/hostname resolves the issue and I am able to connect to services on the LAN again... Very strange. > Maybe make sure they can both still see each other outside of nfs - I > can temporarily 'fix' the issue here by leaving both sides pinging each > other - I've not really had a chance to look at it much recently... > I'll leave them both pinging each other for a while and see if the transfer ends up finishing. Thanks for the info! Also, I found a lot of the below in the /var/log/messages log files on both FBSD systems over and over but they seem very random (time-wise): May 15 03:14:26 node1 kernel: "received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 15\n" May 15 03:17:43 node1 kernel: "received MAD: slid:3 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 15\n" May 15 03:17:50 node1 kernel: ib0: timing out; 2 sends not completed May 15 03:18:22 node1 kernel: "received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 10\n"<7>"received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 14\n" May 15 03:19:26 node1 kernel: "received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 15\n" May 15 03:26:10 node1 kernel: "received MAD: slid:4 sqpn:1 " "dlid_bits:0 dqpn:1 wc_flags:0x0, cls 7, mtd 3, atr 15\n" Regards, Janky Jay, III