From owner-freebsd-fs@FreeBSD.ORG  Fri Jul 12 19:38:57 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 75EE46BD
 for <freebsd-fs@freebsd.org>; Fri, 12 Jul 2013 19:38:57 +0000 (UTC)
 (envelope-from bofh@terranova.net)
Received: from tog.net (tog.net [IPv6:2605:5a00::5])
 by mx1.freebsd.org (Postfix) with ESMTP id 3E64C1B61
 for <freebsd-fs@freebsd.org>; Fri, 12 Jul 2013 19:38:57 +0000 (UTC)
Received: from [IPv6:2605:5a00:ffff::face] (unknown
 [IPv6:2605:5a00:ffff::face])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by tog.net (Postfix) with ESMTPSA id 3bsPZl1xg6z5bH;
 Fri, 12 Jul 2013 15:38:55 -0400 (EDT)
Message-ID: <51E05B48.60607@terranova.net>
Date: Fri, 12 Jul 2013 15:38:48 -0400
From: Travis Mikalson <bofh@terranova.net>
Organization: TerraNovaNet Internet Services
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
To: Bob Healey <healer@rpi.edu>, freebsd-fs@freebsd.org
Subject: Re: Massive Problems with 10G, NFS, ZFS, and iSCSI
References: <51E032B5.9080705@rpi.edu>
In-Reply-To: <51E032B5.9080705@rpi.edu>
X-Enigmail-Version: 0.96.0
OpenPGP: url=http://www.terranova.net/pgp/bofh
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Jul 2013 19:38:57 -0000

Bob Healey wrote:
> I've been beating my head against a brick wall for a week with this and
> 5 similar systems.
> 
> My current major headache:
> Dell Poweredge R610, dual quad core Xeon E5530 @ 2.4GHz, 24GB RAM 4
> onboard bce NICs, 1 mxge NIC, pair of 10K SAS drives on mpt (Dell MB SAS
> controller), pair of 15 drive 1TB RAID 6 arrays on mfi (PERC 6).
> 
> The machine was originally installed with FreeBSD 7.2 and has been
> upgraded through the years to 9.1.  None of the issues I'm currently
> seeing manifested themselves under 9.0.  When under heavy NFS load, the
> server currently becomes non-responsive on the network, unless the
> packet payload is very small (ICMP ping packets with > 124 bytes payload
> get dropped).
> 
> Current network config:
> bce0: management network, connected to the 37 IPMI controllers in the
> rack, has conserver running SOL connections to each
> bce1: link to outside world, everything in rack trying to reach outside
> is NATed through here
> bce2: used for a direct host to host ISCSI link to another host in the
> rack to provide a hard drive for a virtual machine. This machine is the
> iscsi target, and an 80GB zvol is the backing store.
> mxge0/vlan1: connected to first 25 machines in rack
> mxge0/vlan2: connected to remaining 12 machines in rack, plus a vm on
> host #25 on vlan 1
> 
> This is an HPC cluster, with all nodes running RHEL 5.  The landing pads
> (1 real, 1 virtual) are multihomed to both the internal and external
> networks, so the only traffic that crosses the NAT is software updates
> and job accounting information.
> 
> PF is used for firewalling and NAT.  skip is enabled on all internal
> interfaces.

I have zero experience with mxge NICs, and I expect others will have a
lot more to say, but the first thing I'd try in your shoes is complete
removal of pf from your kernel. Try replacing it with ipfw and see if it
helps any.

Pf is generally not recommended above 1Gbit due to it still working
under a single mutex.

I'm linking this for purposes of describing pf's current performance
limitations, not for the rest of the content of the post:
http://forum.pfsense.org/index.php?topic=50812.0;wap2

> Stuff I've tried:  setting vfs.zfs.arc_max="20480M", disabling flow
> control on the 10G NIC, moving the ZIL to some unused space on the boot
> drive (RAID 1, mostly UFS).
> 
> I'm getting lots of Limiting open port RST response from 32325 to 200
> packets/sec in the logs, ISCSI timeouts on the client, and NFS server
> not responding errors.  netstat -i is showing lots of input errors on
> mxge, but i'm not seeing any errors on the switch (Dell Powerconnect
> 6248).  Myricom (nic vendor) is at a loss too.
> 
> Any ideas on what I should try next?  I'm at the point of throwing darts
> blindfolded.
> 
> I've got 5 more similar misbehaving machines, 4 of which behave just
> fine when using igb instead of mxge.

Again, I have no experience with mxge good or bad and I wouldn't rule
out the possibility of mxge driver performance either not being up to
snuff or requiring tuning.

Another thing that comes to mind that you haven't mentioned, have you
tuned your mbuf clusters upwards from default?

My /boot/loader.conf just for a loaded box with only gigabit NICs
adjusts things upwards like so:
kern.ipc.nmbclusters="262144"
kern.ipc.nmbjumbop="262144"
kern.ipc.nmbjumbo16="32000"
kern.ipc.nmbjumbo9="64000"

netstat -m can give you some insight on your mbuf cluster usage, and
would be especially interesting to see during one of these fits you've
described.

-- 
TerraNovaNet Internet Services - Key Largo, FL
Voice: (305)453-4011 x101   Fax: (305)451-5991
http://www.terranova.net/   PGP: 50091B3D
----------------------------------------------
Life's not fair, but the root password helps.