From owner-freebsd-fs@FreeBSD.ORG Fri Jul 12 16:45:48 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 1ABF52B9 for ; Fri, 12 Jul 2013 16:45:48 +0000 (UTC) (envelope-from healer@rpi.edu) Received: from smtp9.server.rpi.edu (gateway.canit.rpi.edu [128.113.2.229]) by mx1.freebsd.org (Postfix) with ESMTP id CC7681FDE for ; Fri, 12 Jul 2013 16:45:47 +0000 (UTC) Received: from smtp-auth1.server.rpi.edu (smtp-auth1.server.rpi.edu [128.113.2.231]) by smtp9.server.rpi.edu (8.14.3/8.14.3/Debian-9.4) with ESMTP id r6CGjf68019382 for ; Fri, 12 Jul 2013 12:45:41 -0400 Received: from smtp-auth1.server.rpi.edu (localhost [127.0.0.1]) by smtp-auth1.server.rpi.edu (Postfix) with ESMTP id F136758033 for ; Fri, 12 Jul 2013 12:45:40 -0400 (EDT) Received: from [128.113.210.26] (vpn-210-26.net.rpi.edu [128.113.210.26]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: healer) by smtp-auth1.server.rpi.edu (Postfix) with ESMTPSA id D478358020 for ; Fri, 12 Jul 2013 12:45:40 -0400 (EDT) Message-ID: <51E032B5.9080705@rpi.edu> Date: Fri, 12 Jul 2013 12:45:41 -0400 From: Bob Healey User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130620 Thunderbird/17.0.7 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Massive Problems with 10G, NFS, ZFS, and iSCSI Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: ClamAV using ClamSMTP X-Bayes-Prob: 0.0001 (Score 0, tokens from: outgoing, @@RPTN) X-Spam-Score: 0.00 () [Hold at 10.10] T_RP_MATCHES_RCVD:-0.01,SPF(none:0) X-CanIt-Incident-Id: 02JXsJFff X-CanIt-Geo: ip=128.113.210.26; country=US; region=NY; city=Troy; postalcode=12180; latitude=42.7495; longitude=-73.5951; metrocode=532; areacode=518; http://maps.google.com/maps?q=42.7495,-73.5951&z=6 X-CanItPRO-Stream: outgoing X-Canit-Stats-ID: Bayes signature not available X-Scanned-By: CanIt (www . roaringpenguin . com) on 128.113.2.229 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Jul 2013 16:45:48 -0000 I've been beating my head against a brick wall for a week with this and 5 similar systems. My current major headache: Dell Poweredge R610, dual quad core Xeon E5530 @ 2.4GHz, 24GB RAM 4 onboard bce NICs, 1 mxge NIC, pair of 10K SAS drives on mpt (Dell MB SAS controller), pair of 15 drive 1TB RAID 6 arrays on mfi (PERC 6). The machine was originally installed with FreeBSD 7.2 and has been upgraded through the years to 9.1. None of the issues I'm currently seeing manifested themselves under 9.0. When under heavy NFS load, the server currently becomes non-responsive on the network, unless the packet payload is very small (ICMP ping packets with > 124 bytes payload get dropped). Current network config: bce0: management network, connected to the 37 IPMI controllers in the rack, has conserver running SOL connections to each bce1: link to outside world, everything in rack trying to reach outside is NATed through here bce2: used for a direct host to host ISCSI link to another host in the rack to provide a hard drive for a virtual machine. This machine is the iscsi target, and an 80GB zvol is the backing store. mxge0/vlan1: connected to first 25 machines in rack mxge0/vlan2: connected to remaining 12 machines in rack, plus a vm on host #25 on vlan 1 This is an HPC cluster, with all nodes running RHEL 5. The landing pads (1 real, 1 virtual) are multihomed to both the internal and external networks, so the only traffic that crosses the NAT is software updates and job accounting information. PF is used for firewalling and NAT. skip is enabled on all internal interfaces. Stuff I've tried: setting vfs.zfs.arc_max="20480M", disabling flow control on the 10G NIC, moving the ZIL to some unused space on the boot drive (RAID 1, mostly UFS). I'm getting lots of Limiting open port RST response from 32325 to 200 packets/sec in the logs, ISCSI timeouts on the client, and NFS server not responding errors. netstat -i is showing lots of input errors on mxge, but i'm not seeing any errors on the switch (Dell Powerconnect 6248). Myricom (nic vendor) is at a loss too. Any ideas on what I should try next? I'm at the point of throwing darts blindfolded. I've got 5 more similar misbehaving machines, 4 of which behave just fine when using igb instead of mxge. -- Bob Healey Systems Administrator Biocomputation and Bioinformatics Constellation and Molecularium healer@rpi.edu (518) 276-4407