Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 12 Jul 2013 12:45:41 -0400
From:      Bob Healey <healer@rpi.edu>
To:        freebsd-fs@freebsd.org
Subject:   Massive Problems with 10G, NFS, ZFS, and iSCSI
Message-ID:  <51E032B5.9080705@rpi.edu>

next in thread | raw e-mail | index | archive | help
I've been beating my head against a brick wall for a week with this and 
5 similar systems.

My current major headache:
Dell Poweredge R610, dual quad core Xeon E5530 @ 2.4GHz, 24GB RAM 4 
onboard bce NICs, 1 mxge NIC, pair of 10K SAS drives on mpt (Dell MB SAS 
controller), pair of 15 drive 1TB RAID 6 arrays on mfi (PERC 6).

The machine was originally installed with FreeBSD 7.2 and has been 
upgraded through the years to 9.1.  None of the issues I'm currently 
seeing manifested themselves under 9.0.  When under heavy NFS load, the 
server currently becomes non-responsive on the network, unless the 
packet payload is very small (ICMP ping packets with > 124 bytes payload 
get dropped).

Current network config:
bce0: management network, connected to the 37 IPMI controllers in the 
rack, has conserver running SOL connections to each
bce1: link to outside world, everything in rack trying to reach outside 
is NATed through here
bce2: used for a direct host to host ISCSI link to another host in the 
rack to provide a hard drive for a virtual machine. This machine is the 
iscsi target, and an 80GB zvol is the backing store.
mxge0/vlan1: connected to first 25 machines in rack
mxge0/vlan2: connected to remaining 12 machines in rack, plus a vm on 
host #25 on vlan 1

This is an HPC cluster, with all nodes running RHEL 5.  The landing pads 
(1 real, 1 virtual) are multihomed to both the internal and external 
networks, so the only traffic that crosses the NAT is software updates 
and job accounting information.

PF is used for firewalling and NAT.  skip is enabled on all internal 
interfaces.

Stuff I've tried:  setting vfs.zfs.arc_max="20480M", disabling flow 
control on the 10G NIC, moving the ZIL to some unused space on the boot 
drive (RAID 1, mostly UFS).

I'm getting lots of Limiting open port RST response from 32325 to 200 
packets/sec in the logs, ISCSI timeouts on the client, and NFS server 
not responding errors.  netstat -i is showing lots of input errors on 
mxge, but i'm not seeing any errors on the switch (Dell Powerconnect 
6248).  Myricom (nic vendor) is at a loss too.

Any ideas on what I should try next?  I'm at the point of throwing darts 
blindfolded.

I've got 5 more similar misbehaving machines, 4 of which behave just 
fine when using igb instead of mxge.

-- 
Bob Healey
Systems Administrator
Biocomputation and Bioinformatics Constellation
and Molecularium
healer@rpi.edu
(518) 276-4407




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?51E032B5.9080705>