Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 13 Jun 2015 11:31:17 +0200
From:      Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= <trasz@FreeBSD.org>
To:        Karli =?iso-8859-1?Q?Sj=F6berg?= <karli.sjoberg@slu.se>
Cc:        Andreas Nilsson <andrnils@gmail.com>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Subject:   Re: [Fwd: Strange networking behaviour in storage server]
Message-ID:  <20150613093117.GB37870@brick.home>
In-Reply-To: <1433149349.14998.181.camel@data-b104.adm.slu.se>
References:  <1433146506.14998.177.camel@data-b104.adm.slu.se> <CAPS9%2BSturmr32jN3d1sfCsQUnyFneSMofT%2BajwqCP=LPg_nseA@mail.gmail.com> <1433149349.14998.181.camel@data-b104.adm.slu.se>

next in thread | previous in thread | raw e-mail | index | archive | help
On 0601T0902, Karli Sjöberg wrote:
> mån 2015-06-01 klockan 10:33 +0200 skrev Andreas Nilsson:
> > 
> > 
> > On Mon, Jun 1, 2015 at 10:14 AM, Karli Sjöberg <karli.sjoberg@slu.se>
> > wrote:
> >         -------- Vidarebefordrat meddelande --------
> >         > Från: Karli Sjöberg <karli.sjoberg@slu.se>
> >         > Till: freebsd-fs@freebsd.org <freebsd-fs@freebsd.org>
> >         > Ämne: Strange networking behaviour in storage server
> >         > Datum: Mon, 1 Jun 2015 07:49:56 +0000
> >         >
> >         > Hey!
> >         >
> >         > So we have this ZFS storage server upgraded from 9.3-RELEASE
> >         to
> >         > 10.1-STABLE to overcome not being able to 1) use SSD drives
> >         as
> >         > L2ARC[1]
> >         > and 2) not being able to hotswap SATA drives[2].
> >         >
> >         > After the upgrade we´ve noticed a very odd networking
> >         behaviour, it
> >         > sends/receives full speed for a while, then there is a
> >         couple of
> >         > minutes
> >         > of complete silence where even terminal commands like an
> >         "ls" just
> >         > waits
> >         > until they are executed and then it starts sending full
> >         speed again. I
> >         > ´ve linked to a screenshot showing this send and pause
> >         behaviour. The
> >         > blue line is the total, green is SMB and turquoise is NFS
> >         over jumbo
> >         > frames. It behaves this way regardless of the protocol.
> >         >
> >         > http://oi62.tinypic.com/33xvjb6.jpg
> >         >
> >         > The problem is that these pauses can sometimes be so long
> >         that
> >         > connections drop. Like someone is copying files over SMB or
> >         iSCSI and
> >         > suddenly they get an error message saying that the transfer
> >         failed and
> >         > they have to start over with the file(s). That´s horrible!
> >         >
> >         > So far NFS has proven to be the most resillient, it´s stupid
> >         simple
> >         > nature just waits and resumes transfer when pause is over.
> >         Kudus for
> >         > that.
> >         >
> >         > The server is driven by a Supermicro X9SRL-F, a Xeon 1620v2
> >         and 64GB
> >         > ECC
> >         > RAM. The hardware has been ruled out, we happened to have a
> >         identical
> >         > MB
> >         > and CPU lying around and that didn´t improve things. We have
> >         also
> >         > installed a Intel PRO 100/1000 Quad-port ethernet adapter to
> >         test if
> >         > that would change things, but it hasn´t, it still behaves
> >         this way.
> >         >
> >         > The two built-in NIC's are Intel 82574L and the Quad-port
> >         NIC's are
> >         > Intel 82571EB, so both em(4) driven. I happen to know that
> >         the em
> >         > driver
> >         > has updated between 9.3 and 10.1. Perhaps that is to blame,
> >         but I have
> >         > no idea.
> >         >
> >         > Is there anyone that can make sense of this?
> >         >
> >         > [1]:
> >         > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=197164
> >         >
> >         > [2]:
> >         > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=191348
> >         >
> >         > /K
> >         >
> >         >
> >         
> >         
> >         Another observation I´ve made is that during these pauses, the
> >         entire
> >         system is put on hold, even ZFS scrub stops and then resumes
> >         after a
> >         while. Looking in top, the system is completly idle.
> >         
> >         Normally during scrub, the kernel eats 20-30% CPU, but during
> >         a pause,
> >         even the [kernel] goes down to 0.00%. Makes me think the
> >         networking has
> >         nothing to do with it.
> >         
> >         What´s then to blame? ZFS?
> >         
> >         /K
> >         _______________________________________________
> >         freebsd-fs@freebsd.org mailing list
> >         http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >         To unsubscribe, send any mail to
> >         "freebsd-fs-unsubscribe@freebsd.org"
> > 
> > 
> > Hello,
> > 
> > 
> > does this happen when clients are only reading from server? 
> 
> Yes it happens when clients are only reading from the server.
> 
> > Otherwise I would suspect that it could be caused by ZFS writing out a
> > large chunck of data sitting in its caches, and until that is complete
> > I/O is stalled.
> 
> That´s what so strange, we have three more systems set up about the same
> size and none of others are acting this way.
> 
> The only thing I can think of that differs that we haven´t tested ruling
> out yet is ctld, the other systems are still running istgt as their
> iSCSI daemon.

So, were you able to rule out ctld?

Do you have local, or terminal, access to the machine?  When the problem
manifests, do local commands work?  In other words, is the whole machine
wedged, or just the network?  If it's just the network, it might be
caused by ctld consuming all available mbufs.  You could run "netstat -m"
before and after to check that.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150613093117.GB37870>