From owner-freebsd-questions@freebsd.org Fri Sep 23 20:56:02 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 23C22BE6887 for ; Fri, 23 Sep 2016 20:56:02 +0000 (UTC) (envelope-from lokadamus@gmx.de) Received: from mout.gmx.net (mout.gmx.net [212.227.17.22]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 8C491646 for ; Fri, 23 Sep 2016 20:56:00 +0000 (UTC) (envelope-from lokadamus@gmx.de) Received: from [192.168.0.156] ([95.91.224.5]) by mail.gmx.com (mrgmx102) with ESMTPSA (Nemesis) id 0M3MAG-1awfoD3C2i-00r3nj; Fri, 23 Sep 2016 22:55:57 +0200 Subject: Re: Server gets a high load, but no CPU use, and then later stops respond on the network To: =?UTF-8?Q?St=c3=a5le_Kristoffersen?= , Anton Yuzhaninov References: <20160913232351.GA36091@putsch.kolbu.ws> <68f553b9-8546-7707-df86-88851b3283f8@citrin.ru> <20160921093809.GB13386@putsch.kolbu.ws> Cc: "freebsd-questions@freebsd.org" From: "lokadamus@gmx.de" Message-ID: <94870139-26b7-ef0f-dbd9-df599642bac3@gmx.de> Date: Fri, 23 Sep 2016 22:55:55 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <20160921093809.GB13386@putsch.kolbu.ws> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K0:b3PGuy1Y3JSJwkQbXfOT7ztX7gQpTEZCzmbB1vnGFcvS6xOuM3i gLBLkkPZkS07TAK143todhbJQKdB6t10p1X9kpj6TnwrxcZWdm2lzh5KEQ2annNqQSaRgtp 7EzPIG0hEnxQ8GFRHDJ+1McsfHwyPkfQaLbgstNjX0+w2BDgxOFzZUyLz8DJA0kAyXUmtHe tSv5FvDe+ZcOm5usL73ZA== X-UI-Out-Filterresults: notjunk:1;V01:K0:g2CfY96Zun8=:uZcc7w2zA4JOVt9ZWRfgWj l5Hu/BKbl9Vuw+wR2pQkAvI1Gp4VjFVbAErfGQ0W5QBjT5NIQzVzD2LxdXKt6cTx7eXLsRHGd 81kKk3l0xvaRY3XvU0J6Jy7voV18ZdWOmhXRHSkY6OuZk0+1gEo3xAUK709+jjUEV7VivddW4 adeZJnA9pOHNk6omLIGZSZgL5ejvaklOR8GqSUs1xtQKEEf+7kpv4KfpqeiJccRpzMRd2RQ5+ gG7qEMpeYJ6qRnA/7Zqt1hV4hFHBGDiR4Rly4/cxeSBdn81m0pAy5XZuUM4phieumXDyRkKuW f37OU5uepAIKHxKsMHEmjzG7eDKUm5eMbddkoRkR7QPYX6a1/B1WPwd2SPPWLiaY/8F2cR1Rt TfHnGqH1XM8TAQpb9HiVu5KAYPNK7OLvGOe40meDdwW4ne9dHkLmn1VzZF6RALLnDORIDHIi9 Ipbc3OF6d2SCEEOPIEtKz/0wahbod3blnn+6+2KM+aPRIWnfZPARA1mCjYsDTyj0nVIj8+kGz JM7+WE39YEBMxB8OvnZGxd00s2fvWurSv7Es4Tce32wewVkX9wGk3mAVB0cWBF/bqqCDXIPfh tdxPM2FefrqtN+2dtsb+nbhXkSQ8j3oi6oxHbuRsdCsbcs3Z3yl0SHHXpSgYiWxkt7RW8tdh/ nu38Yf3BSFT7tHSVRyk4CYGJ6OTEfr8YOcT4HTPpjxCvxuc//n6U3cywOnsEJgGC3Ly71Hngy vMCruhbjM19Ll9ZzwTaQzAiFuufQ6cyKzHHwQVSa0F4SgzvLy488wrSNVimzBGs4IuXLnip7P 3rnSHHl X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Sep 2016 20:56:02 -0000 On 09/21/16 11:38, Ståle Kristoffersen wrote: > On 2016-09-20 at 16:57, Anton Yuzhaninov wrote: >> On 2016-09-13 19:23, Stxe5le Bordal Kristoffersen wrote: >>> about once a day, but not in any pattern, it starts getting a load of 5-10 >>> and usually stops responding over the network before I notice it. >> >> Does it stop responding completely (including ping) or only some >> services and ssh doesn't respond? > > It just starts getting more and more lagged. It usually responds to ping, > but ssh can start to time out. Already opened ssh sessions can live quite > long, but running stuff can be a problem after a while. > >> >>> From googling a bit, I have tried to disable msix on the igb network >>> interface, and increased the nmbclusters with no apparent change in behaviour. >>> (kern.ipc.nmbclusters="1000000" and hw.igb.enable_msix=0 in loader.conf) >> >> kern.ipc.nmbclusters on modern FreeBSD version autotuned to very big >> value and manual increasing is rarely need. >> >> Disabling msix on igb is also unlikely need. > > This was more of a "grasping at straws"-move, and only included that for > completeness. > >>> All I see is that the igb0 taskq pid is almost always in the RUN state when >>> the machine is having trouble. >> >> There is no igb0 taskq in top output below. >> >> To see and inspect how top output looks when machine stops responding it >> is useful to run top from cron and log output. >> >> Example script for top logging: >> https://bitbucket.org/snippets/citrin/BpeXb >> >> In top output you should look at WCPU and STATE for kernel threads and >> for unresponding network daemons. > > I've now configured that script to run, and I'll share the results the next > time the server has issues. > >> Also do you have network load graph (bytes and packets per second) for >> this host (I saw munin in process list) - may be load is too high in >> moments when host not responding. > > When this happens network traffic crawls to a stop. I've also checked that > there isn't any other traffic on the network port causing problems. I also > tried doing 'ifconfig igb0 down' on the interface just to see if the server > would unclog itself. > >> Do you use firewalls or netgraph? > > No, nothing configured. > >> Which is the primary function of this server? > > Its a fileserver, sharing files via samba and FTP. > I have no idea. Can you tell me, what dmesg tell you? it looks like there is a system overun, but difficult to understand why. Greetings