From owner-freebsd-current@freebsd.org Tue Apr 5 06:46:20 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5C133B034AD for ; Tue, 5 Apr 2016 06:46:20 +0000 (UTC) (envelope-from cy.schubert@komquats.com) Received: from smtp-out-no.shaw.ca (smtp-out-no.shaw.ca [64.59.134.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "Client", Issuer "CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 2734116FD; Tue, 5 Apr 2016 06:46:19 +0000 (UTC) (envelope-from cy.schubert@komquats.com) Received: from spqr.komquats.com ([96.50.22.10]) by shaw.ca with SMTP id nKkcatdHDQeymnKkdaXUxY; Tue, 05 Apr 2016 00:46:13 -0600 X-Authority-Analysis: v=2.2 cv=H9KZ+KQi c=1 sm=1 tr=0 a=jvE2nwUzI0ECrNeyr98KWA==:117 a=jvE2nwUzI0ECrNeyr98KWA==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kziv93cY1bsA:10 a=BWvPGDcYAAAA:8 a=zxA2vyXaAAAA:8 a=YxBL1-UpAAAA:8 a=6I5d2MoRAAAA:8 a=ij63hNtKrUkdOZeuT7AA:9 Received: from slippy.cwsent.com (slippy [10.1.1.91]) by spqr.komquats.com (Postfix) with ESMTP id 0A3EC13751; Mon, 4 Apr 2016 23:46:10 -0700 (PDT) Received: from slippy (localhost [127.0.0.1]) by slippy.cwsent.com (8.15.2/8.15.2) with ESMTP id u356k850078565; Mon, 4 Apr 2016 23:46:08 -0700 (PDT) (envelope-from Cy.Schubert@komquats.com) Message-Id: <201604050646.u356k850078565@slippy.cwsent.com> X-Mailer: exmh version 2.8.0 04/21/2012 with nmh-1.6 Reply-to: Cy Schubert From: Cy Schubert X-os: FreeBSD X-Sender: cy@cwsent.com X-URL: http://www.komquats.com/ To: "O. Hartmann" cc: Cy Schubert , Michael Butler , "K. Macy" , FreeBSD CURRENT Subject: Re: CURRENT slow and shaky network stability In-Reply-To: Message from "O. Hartmann" of "Tue, 05 Apr 2016 08:20:47 +0200." <20160405082047.670d7241@freyja.zeit4.iv.bundesimmobilien.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Mon, 04 Apr 2016 23:46:08 -0700 X-CMAE-Envelope: MS4wfOb0/5zKoeB4Dsa9fLix6RAQLwEXPHPhyjM2N65be73d0ej5CwfS7irXDOvLjbxuqTjO0lEWT23YnmLDU43Zp1jHTQ73I/Edfw4Or+F4VkZ1cxXze8DU VMqjuLVLObAs/jnm9xPP8fszP4yRbHX6K5463Ac+L/mZn6BqmMNGq1CukNTc8+1Wp2zsL+2t2abds7005OUcriHe7EuNJ0thFM1uvotWuVj57J9dXWYR218B qOCijtFnZ9dE3igVIEF6d9/WVDqvkCwLCV6C68++JFMrKqhryLtR/WfQgmu0YRA8 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Apr 2016 06:46:20 -0000 In message <20160405082047.670d7241@freyja.zeit4.iv.bundesimmobilien.de>, "O. H artmann" writes: > On Sat, 02 Apr 2016 16:14:57 -0700 > Cy Schubert wrote: > > > In message <20160402231955.41b05526.ohartman@zedat.fu-berlin.de>, "O. > > Hartmann" > > writes: > > > --Sig_/eJJPtbrEuK1nN2zIpc7BmVr > > > Content-Type: text/plain; charset=US-ASCII > > > Content-Transfer-Encoding: quoted-printable > > > > > > Am Sat, 2 Apr 2016 11:39:10 +0200 > > > "O. Hartmann" schrieb: > > > > > > > Am Sat, 2 Apr 2016 10:55:03 +0200 > > > > "O. Hartmann" schrieb: > > > >=20 > > > > > Am Sat, 02 Apr 2016 01:07:55 -0700 > > > > > Cy Schubert schrieb: > > > > > =20 > > > > > > In message <56F6C6B0.6010103@protected-networks.net>, Michael Butle > r > > > > > > = > > > writes: =20 > > > > > > > -current is not great for interactive use at all. The strategy of > > > > > > > pre-emptively dropping idle processes to swap is hurting .. big > > > > > > > tim= > > > e. =20 > > > > > >=20 > > > > > > FreeBSD doesn't "preemptively" or arbitrarily push pages out to > > > > > > disk.= > > > LRU=20 > > > > > > doesn't do this. > > > > > > =20 > > > > > > >=20 > > > > > > > Compare inactive memory to swap in this example .. > > > > > > >=20 > > > > > > > 110 processes: 1 running, 108 sleeping, 1 zombie > > > > > > > CPU: 1.2% user, 0.0% nice, 4.3% system, 0.0% interrupt, 94.5% > > > > > > > i= > > > dle > > > > > > > Mem: 474M Active, 1609M Inact, 764M Wired, 281M Buf, 119M Free > > > > > > > Swap: 4096M Total, 917M Used, 3178M Free, 22% Inuse =20 > > > > > >=20 > > > > > > To analyze this you need to capture vmstat output. You'll see the > > > > > > fre= > > > e pool=20 > > > > > > dip below a threshold and pages go out to disk in response. If you > > > > > > ha= > > > ve=20 > > > > > > daemons with small working sets, pages that are not part of the > > > > > > worki= > > > ng=20 > > > > > > sets for daemons or applications will eventually be paged out. This > > > > > > i= > > > s not=20 > > > > > > a bad thing. In your example above, the 281 MB of UFS buffers are > > > > > > mor= > > > e=20 > > > > > > active than the 917 MB paged out. If it's paged out and never used > > > > > > ag= > > > ain,=20 > > > > > > then it doesn't hurt. However the 281 MB of buffers saves you I/O. > > > > > > Th= > > > e=20 > > > > > > inactive pages are part of your free pool that were active at one > > > > > > tim= > > > e but=20 > > > > > > now are not. They may be reclaimed and if they are, you've just > > > > > > saved= > > > more=20 > > > > > > I/O. > > > > > >=20 > > > > > > Top is a poor tool to analyze memory use. Vmstat is the better tool > > > > > > t= > > > o help=20 > > > > > > understand memory use. Inactive memory isn't a bad thing per se. > > > > > > Moni= > > > tor=20 > > > > > > page outs, scan rate and page reclaims. > > > > > >=20 > > > > > > =20 > > > > >=20 > > > > > I give up! Tried to check via ssh/vmstat what is going on. Last lines > > > > > b= > > > efore broken > > > > > pipe: > > > > >=20 > > > > > [...] > > > > > procs memory page disks faults > cpu > > > > > r b w avm fre flt re pi po fr sr ad0 ad1 in sy c > s > > > > > = > > > us sy id > > > > > 22 0 22 5.8G 1.0G 46319 0 0 0 55721 1297 0 4 219 23907 > > > > > 540= > > > 0 95 5 0 > > > > > 22 0 22 5.4G 1.3G 51733 0 0 0 72436 1162 0 0 108 40869 > > > > > 345= > > > 9 93 7 0 > > > > > 15 0 22 12G 1.2G 54400 0 27 0 52188 1160 0 42 148 52192 > > > > > 436= > > > 6 91 9 0 > > > > > 14 0 22 12G 1.0G 44954 0 37 0 37550 1179 0 39 141 86209 > > > > > 436= > > > 8 88 12 0 > > > > > 26 0 22 12G 1.1G 60258 0 81 0 69459 1119 0 27 123 779569 > > > > > 704= > > > 359 87 13 0 > > > > > 29 3 22 13G 774M 50576 0 68 0 32204 1304 0 2 102 507337 > > > > > 484= > > > 861 93 7 0 > > > > > 27 0 22 13G 937M 47477 0 48 0 59458 1264 3 2 112 68131 > > > > > 4440= > > > 7 95 5 0 > > > > > 36 0 22 13G 829M 83164 0 2 0 82575 1225 1 0 126 99366 > > > > > 3806= > > > 0 89 11 0 > > > > > 35 0 22 6.2G 1.1G 98803 0 13 0 121375 1217 2 8 112 99371 > > > > > 49= > > > 99 85 15 0 > > > > > 34 0 22 13G 723M 54436 0 20 0 36952 1276 0 17 153 29142 > > > > > 443= > > > 1 95 5 0 > > > > > Fssh_packet_write_wait: Connection to 192.168.0.1 port 22: Broken pip > e > > > > >=20 > > > > >=20 > > > > > This makes this crap system completely unusable. The server (FreeBSD > > > > > 11= > > > .0-CURRENT #20 > > > > > r297503: Sat Apr 2 09:02:41 CEST 2016 amd64) in question did > > > > > poudriere= > > > bulk job. I > > > > > can not even determine what terminal goes down first - another one, > > > > > muc= > > > h more time > > > > > idle than the one shwoing the "vmstat 5" output, is still alive!=20 > > > > >=20 > > > > > i consider this a serious bug and it is no benefit what happened sinc > e > > > > > = > > > this "fancy" > > > > > update. :-( =20 > > > >=20 > > > > By the way - it might be of interest and some hint. > > > >=20 > > > > One of my boxes is acting as server and gateway. It utilises NAT, IPFW, > > > > w= > > > hen it is under > > > > high load, as it was today, sometimes passing the network flow from ISP > > > > i= > > > nto the network > > > > for clients is extremely slow. I do not consider this the reason for > > > > coll= > > > apsing ssh > > > > sessions, since this incident happens also under no-load, but in the > > > > over= > > > all-view onto > > > > the problem, this could be a hint - I hope.=20 > > > > > > I just checked on one box, that "broke pipe" very quickly after I started > p= > > > oudriere, > > > while it did well a couple of hours before until the pipe broke. It seems > i= > > > t's load > > > dependend when the ssh session gets wrecked, but more important, after th > e = > > > long-haul > > > poudriere run, I rebooted the box and tried again with the mentioned brok > en= > > > pipe after a > > > couple of minutes after poudriere ran. Then I left the box for several ho > ur= > > > s and logged > > > in again and checked the swap. Although there was for hours no load or ot > he= > > > r pressure, > > > there were 31% of of swap used - still (box has 16 GB of RAM and is prope > ll= > > > ed by a XEON > > > E3-1245 V2). > > > > > > > 31%! Is it *actively* paging or is the 31% previously paged out and no > > paging is *currently* being experienced? 31% of how swap space in total? > > > > Also, what does ps aumx or ps aumxww say? Pipe it to head -40 or similar. > > > > > > On FreeBSD 11.0-CURRENT #4 r297573: Tue Apr 5 07:01:19 CEST 2016 amd64, loca > l > network, no NAT. Stuck ssh session in the middle of administering and leaving > the console/ssh session for a couple of minutes: > > root 2064 0.0 0.1 91416 8492 - Is 07:18 0:00.03 sshd: > hartmann [priv] (sshd) > > hartmann 2108 0.0 0.1 91416 8664 - I 07:18 0:07.33 sshd: > hartmann@pts/0 (sshd) > > root 72961 0.0 0.1 91416 8496 - Is 08:11 0:00.03 sshd: > hartmann [priv] (sshd) > > hartmann 72970 0.0 0.1 91416 8564 - S 08:11 0:00.02 sshd: > hartmann@pts/1 (sshd) > > The situation is worse and i consider this a serious bug. > There's not a lot to go on here. Do you have physical access to the machine to pop into DDB and take a look? You did say you're using a lot of swap. IIRC 30%. You didn't answer how much 30% was of. Without more data I can't help you. At the best I can take wild guesses but that won't help you. Try to answer the questions I asked last week and we can go further. Until then all we can do is wildly guess. -- Cheers, Cy Schubert or FreeBSD UNIX: Web: http://www.FreeBSD.org The need of the many outweighs the greed of the few.