From owner-freebsd-net@FreeBSD.ORG Thu Feb 27 17:02:38 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8A853808; Thu, 27 Feb 2014 17:02:38 +0000 (UTC) Received: from mail-vc0-x234.google.com (mail-vc0-x234.google.com [IPv6:2607:f8b0:400c:c03::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 30D4A1C44; Thu, 27 Feb 2014 17:02:38 +0000 (UTC) Received: by mail-vc0-f180.google.com with SMTP id ks9so2801023vcb.11 for ; Thu, 27 Feb 2014 09:02:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=d09wWvro5YhnbnrYU0NqmAJUrGrmqyn/gtvnDb7ybDk=; b=u4WwclndzUeVoWbcbmTo9GsXPC181DnPR1HxJnxvnL16JAU/FeLaMtavZtYZ20Hzz1 /HJ2lsrCwpTCglHmkaqt0R5n/iDAp1DhbekLrP9qNEGSNmwS5IrNZ38a99fAqskiqFRE rAlq+PKp5Z5+NLBNkR6rvJ6tjyi9uew1SJ1RsAMlSvxLcq/oePYJBVcr89uOFY96DiE9 5qrquKaELmDhmrD7Y2RYf/iu7tFVpvO8MA5mUgTkEFts0WIFy5FbNzkO7qUUEwOlPxan V2xEGktWo5JUy1dKZ6J5pHblH0k/dL6ZGbVKpbZWk7VhKZ9+AOblj84eF3te5DeHhCKK i4xQ== MIME-Version: 1.0 X-Received: by 10.220.69.133 with SMTP id z5mr741219vci.49.1393520556452; Thu, 27 Feb 2014 09:02:36 -0800 (PST) Received: by 10.221.11.135 with HTTP; Thu, 27 Feb 2014 09:02:36 -0800 (PST) In-Reply-To: <76EBC5F0-DA4E-4A60-A10E-093F4E1BD1EF@hostpoint.ch> References: <532475749.13937791.1393462831884.JavaMail.root@uoguelph.ca> <76EBC5F0-DA4E-4A60-A10E-093F4E1BD1EF@hostpoint.ch> Date: Thu, 27 Feb 2014 09:02:36 -0800 Message-ID: Subject: Re: Network loss From: Jack Vogel To: Markus Gebert Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: Johan Kooijman , FreeBSD Net , Rick Macklem , John Baldwin X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Feb 2014 17:02:38 -0000 I would make SURE that you have enough mbuf resources of whatever size pool that you are using (2, 4, 9K), and I would try the code in HEAD if you had not. Jack On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert wrote: > > On 27.02.2014, at 02:00, Rick Macklem wrote: > > > John Baldwin wrote: > >> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote: > >>> Hi all, > >>> > >>> I have a weird situation here where I can't get my head around. > >>> > >>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once in > >>> a while > >>> the Linux clients loose their NFS connection: > >>> > >>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding, > >>> timed out > >>> > >>> Not all boxes, just one out of the cluster. The weird part is that > >>> when I > >>> try to ping a Linux client from the FreeBSD box, I have between 10 > >>> and 30% > >>> packetloss - all day long, no specific timeframe. If I ping the > >>> Linux > >>> clients - no loss. If I ping back from the Linux clients to FBSD > >>> box - no > >>> loss. > >>> > >>> The errors I get when pinging a Linux client is this one: > >>> ping: sendto: File too large > > We were facing similar problems when upgrading to 9.2 and have stayed with > 9.1 on affected systems for now. We've seen this on HP G8 blades with > 82599EB controllers: > > ix0@pci0:4:0:0: class=0x020000 card=0x18d0103c chip=0x10f88086 rev=0x01 > hdr=0x00 > vendor = 'Intel Corporation' > device = '82599EB 10 Gigabit Dual Port Backplane Connection' > class = network > subclass = ethernet > > We didn't find a way to trigger the problem reliably. But when it occurs, > it usually affects only one interface. Symptoms include: > > - socket functions return the 'File too large' error mentioned by Johan > - socket functions return 'No buffer space' available > - heavy to full packet loss on the affected interface > - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that should > have timed out stick around forever (socket on the other side could have > been closed ours ago) > - userland programs using the corresponding sockets usually got stuck too > (can't find kernel traces right now, but always in network related syscalls) > > Network is only lightly loaded on the affected systems (usually 5-20 mbit, > capped at 200 mbit, per server), and netstat never showed any indication of > ressource shortage (like mbufs). > > What made the problem go away temporariliy was to ifconfig down/up the > affected interface. > > We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not really > stable. Also, we tested a few revisions between 9.1 and 9.2 to find out > when the problem started. Unfortunately, the ixgbe driver turned out to be > mostly unstable on our systems between these releases, worse than on 9.2. > The instability was introduced shortly after to 9.1 and fixed only very > shortly before 9.2 release. So no luck there. We ended up using 9.1 with > backports of 9.2 features we really need. > > What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe driver > or a combination of both that causes these problems. Unfortunately we ran > out of time (and ideas). > > > >> EFBIG is sometimes used for drivers when a packet takes too many > >> scatter/gather entries. Since you mentioned NFS, one thing you can > >> try is to > >> disable TSO on the intertface you are using for NFS to see if that > >> "fixes" it. > >> > > And please email if you try it and let us know if it helps. > > > > I've think I've figured out how 64K NFS read replies can do this, > > but I'll admit "ping" is a mystery? (Doesn't it just send a single > > packet that would be in a single mbuf?) > > > > I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I > > don't know if it can happen for an mbuf chain with < 32 entries? > > We don't use the nfs server on our systems, but they're (new)nfsclients. > So I don't think our problem is nfs related, unless the default rsize/wsize > for client mounts is not 8K, which I thought it was. Can you confirm this, > Rick? > > IIRC, disabling TSO did not make any difference in our case. > > > Markus > >