Date: Fri, 12 Jul 2013 12:45:41 -0400 From: Bob Healey <healer@rpi.edu> To: freebsd-fs@freebsd.org Subject: Massive Problems with 10G, NFS, ZFS, and iSCSI Message-ID: <51E032B5.9080705@rpi.edu>
next in thread | raw e-mail | index | archive | help
I've been beating my head against a brick wall for a week with this and 5 similar systems. My current major headache: Dell Poweredge R610, dual quad core Xeon E5530 @ 2.4GHz, 24GB RAM 4 onboard bce NICs, 1 mxge NIC, pair of 10K SAS drives on mpt (Dell MB SAS controller), pair of 15 drive 1TB RAID 6 arrays on mfi (PERC 6). The machine was originally installed with FreeBSD 7.2 and has been upgraded through the years to 9.1. None of the issues I'm currently seeing manifested themselves under 9.0. When under heavy NFS load, the server currently becomes non-responsive on the network, unless the packet payload is very small (ICMP ping packets with > 124 bytes payload get dropped). Current network config: bce0: management network, connected to the 37 IPMI controllers in the rack, has conserver running SOL connections to each bce1: link to outside world, everything in rack trying to reach outside is NATed through here bce2: used for a direct host to host ISCSI link to another host in the rack to provide a hard drive for a virtual machine. This machine is the iscsi target, and an 80GB zvol is the backing store. mxge0/vlan1: connected to first 25 machines in rack mxge0/vlan2: connected to remaining 12 machines in rack, plus a vm on host #25 on vlan 1 This is an HPC cluster, with all nodes running RHEL 5. The landing pads (1 real, 1 virtual) are multihomed to both the internal and external networks, so the only traffic that crosses the NAT is software updates and job accounting information. PF is used for firewalling and NAT. skip is enabled on all internal interfaces. Stuff I've tried: setting vfs.zfs.arc_max="20480M", disabling flow control on the 10G NIC, moving the ZIL to some unused space on the boot drive (RAID 1, mostly UFS). I'm getting lots of Limiting open port RST response from 32325 to 200 packets/sec in the logs, ISCSI timeouts on the client, and NFS server not responding errors. netstat -i is showing lots of input errors on mxge, but i'm not seeing any errors on the switch (Dell Powerconnect 6248). Myricom (nic vendor) is at a loss too. Any ideas on what I should try next? I'm at the point of throwing darts blindfolded. I've got 5 more similar misbehaving machines, 4 of which behave just fine when using igb instead of mxge. -- Bob Healey Systems Administrator Biocomputation and Bioinformatics Constellation and Molecularium healer@rpi.edu (518) 276-4407
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51E032B5.9080705>