Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 14 Jun 2015 15:26:16 +0200
From:      InterNetX - Juergen Gotteswinter <juergen.gotteswinter@internetx.com>
To:        =?UTF-8?B?S2FybGkgU2rDtmJlcmc=?= <karli.sjoberg@slu.se>,  Andreas Nilsson <andrnils@gmail.com>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: [Fwd: Strange networking behaviour in storage server]
Message-ID:  <557D80F8.9000505@internetx.com>
In-Reply-To: <557D8092.7050301@internetx.com>
References:  <1433146506.14998.177.camel@data-b104.adm.slu.se> <CAPS9%2BSturmr32jN3d1sfCsQUnyFneSMofT%2BajwqCP=LPg_nseA@mail.gmail.com> <1433149349.14998.181.camel@data-b104.adm.slu.se> <20150613093117.GB37870@brick.home> <557D8092.7050301@internetx.com>

next in thread | previous in thread | raw e-mail | index | archive | help


Am 14.06.2015 um 15:24 schrieb InterNetX - Juergen Gotteswinter:
> 
> 
> Am 13.06.2015 um 11:31 schrieb Edward Tomasz Napierała:
>> On 0601T0902, Karli Sjöberg wrote:
>>> mån 2015-06-01 klockan 10:33 +0200 skrev Andreas Nilsson:
>>>>
>>>>
>>>> On Mon, Jun 1, 2015 at 10:14 AM, Karli Sjöberg <karli.sjoberg@slu.se>
>>>> wrote:
>>>>         -------- Vidarebefordrat meddelande --------
>>>>         > Från: Karli Sjöberg <karli.sjoberg@slu.se>
>>>>         > Till: freebsd-fs@freebsd.org <freebsd-fs@freebsd.org>
>>>>         > Ämne: Strange networking behaviour in storage server
>>>>         > Datum: Mon, 1 Jun 2015 07:49:56 +0000
>>>>         >
>>>>         > Hey!
>>>>         >
>>>>         > So we have this ZFS storage server upgraded from 9.3-RELEASE
>>>>         to
>>>>         > 10.1-STABLE to overcome not being able to 1) use SSD drives
>>>>         as
>>>>         > L2ARC[1]
>>>>         > and 2) not being able to hotswap SATA drives[2].
>>>>         >
>>>>         > After the upgrade we´ve noticed a very odd networking
>>>>         behaviour, it
>>>>         > sends/receives full speed for a while, then there is a
>>>>         couple of
>>>>         > minutes
>>>>         > of complete silence where even terminal commands like an
>>>>         "ls" just
>>>>         > waits
>>>>         > until they are executed and then it starts sending full
>>>>         speed again. I
>>>>         > ´ve linked to a screenshot showing this send and pause
>>>>         behaviour. The
>>>>         > blue line is the total, green is SMB and turquoise is NFS
>>>>         over jumbo
>>>>         > frames. It behaves this way regardless of the protocol.
>>>>         >
>>>>         > http://oi62.tinypic.com/33xvjb6.jpg
>>>>         >
>>>>         > The problem is that these pauses can sometimes be so long
>>>>         that
>>>>         > connections drop. Like someone is copying files over SMB or
>>>>         iSCSI and
>>>>         > suddenly they get an error message saying that the transfer
>>>>         failed and
>>>>         > they have to start over with the file(s). That´s horrible!
>>>>         >
>>>>         > So far NFS has proven to be the most resillient, it´s stupid
>>>>         simple
>>>>         > nature just waits and resumes transfer when pause is over.
>>>>         Kudus for
>>>>         > that.
>>>>         >
>>>>         > The server is driven by a Supermicro X9SRL-F, a Xeon 1620v2
>>>>         and 64GB
>>>>         > ECC
>>>>         > RAM. The hardware has been ruled out, we happened to have a
>>>>         identical
>>>>         > MB
>>>>         > and CPU lying around and that didn´t improve things. We have
>>>>         also
>>>>         > installed a Intel PRO 100/1000 Quad-port ethernet adapter to
>>>>         test if
>>>>         > that would change things, but it hasn´t, it still behaves
>>>>         this way.
>>>>         >
>>>>         > The two built-in NIC's are Intel 82574L and the Quad-port
>>>>         NIC's are
>>>>         > Intel 82571EB, so both em(4) driven. I happen to know that
>>>>         the em
>>>>         > driver
>>>>         > has updated between 9.3 and 10.1. Perhaps that is to blame,
>>>>         but I have
>>>>         > no idea.
>>>>         >
>>>>         > Is there anyone that can make sense of this?
>>>>         >
>>>>         > [1]:
>>>>         > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197164
>>>>         >
>>>>         > [2]:
>>>>         > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191348
>>>>         >
>>>>         > /K
>>>>         >
>>>>         >
>>>>         
>>>>         
>>>>         Another observation I´ve made is that during these pauses, the
>>>>         entire
>>>>         system is put on hold, even ZFS scrub stops and then resumes
>>>>         after a
>>>>         while. Looking in top, the system is completly idle.
>>>>         
>>>>         Normally during scrub, the kernel eats 20-30% CPU, but during
>>>>         a pause,
>>>>         even the [kernel] goes down to 0.00%. Makes me think the
>>>>         networking has
>>>>         nothing to do with it.
>>>>         
>>>>         What´s then to blame? ZFS?
>>>>         
>>>>         /K
>>>>         _______________________________________________
>>>>         freebsd-fs@freebsd.org mailing list
>>>>         http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>         To unsubscribe, send any mail to
>>>>         "freebsd-fs-unsubscribe@freebsd.org"
>>>>
>>>>
>>>> Hello,
>>>>
>>>>
>>>> does this happen when clients are only reading from server? 
>>>
>>> Yes it happens when clients are only reading from the server.
>>>
>>>> Otherwise I would suspect that it could be caused by ZFS writing out a
>>>> large chunck of data sitting in its caches, and until that is complete
>>>> I/O is stalled.
>>>
>>> That´s what so strange, we have three more systems set up about the same
>>> size and none of others are acting this way.
>>>
>>> The only thing I can think of that differs that we haven´t tested ruling
>>> out yet is ctld, the other systems are still running istgt as their
>>> iSCSI daemon.
>>
>> So, were you able to rule out ctld?
>>
>> Do you have local, or terminal, access to the machine?  When the problem
>> manifests, do local commands work?  In other words, is the whole machine
>> wedged, or just the network?  If it's just the network, it might be
>> caused by ctld consuming all available mbufs.  You could run "netstat -m"
>> before and after to check that.
>>
> 
> You already checked (doublechecked) HBA Firmware etc? Cabling is fine?
> 
> I expect you already disabled tso, gro, rxcsum, txcsum on your NIC(s). I
> had similar effects, with all those fancy uberfeatures enabled.
> 
> Give it a try... ifconfig foo0 -rxcsum -txcsum -tso -gro
> 
> Capturing a few MB of Traffic before/after could be also very helpful to
> see if...
> 

errm, sorry. Forgot something... how does your Network Setup look like?
Link Aggregations? Which Switches, which Linespeed, Stacked or not?

Any Drops / Errors on your Interfaces?


> 
>> _______________________________________________
>> freebsd-fs@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?557D80F8.9000505>