Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 29 Mar 2016 08:08:40 +0200
From:      "O. Hartmann" <ohartman@zedat.fu-berlin.de>
To:        Don Lewis <truckman@FreeBSD.org>
Cc:        imb@protected-networks.net, kmacy@freebsd.org, freebsd-current@freebsd.org
Subject:   Re: CURRENT slow and shaky network stability
Message-ID:  <20160329080840.3da929de@freyja.zeit4.iv.bundesimmobilien.de>
In-Reply-To: <201603282152.u2SLq9HN086958@gw.catspoiler.org>
References:  <20160328084440.501ef862.ohartman@zedat.fu-berlin.de> <201603282152.u2SLq9HN086958@gw.catspoiler.org>

index | next in thread | previous in thread | raw e-mail

On Mon, 28 Mar 2016 14:52:09 -0700 (PDT)
Don Lewis <truckman@FreeBSD.org> wrote:

> On 28 Mar, O. Hartmann wrote:
> > Am Sat, 26 Mar 2016 14:26:45 -0700 (PDT)
> > Don Lewis <truckman@FreeBSD.org> schrieb:
> >   
> >> On 26 Mar, Michael Butler wrote:  
> >> > -current is not great for interactive use at all. The strategy of
> >> > pre-emptively dropping idle processes to swap is hurting .. big time.
> >> > 
> >> > Compare inactive memory to swap in this example ..
> >> > 
> >> > 110 processes: 1 running, 108 sleeping, 1 zombie
> >> > CPU:  1.2% user,  0.0% nice,  4.3% system,  0.0% interrupt, 94.5% idle
> >> > Mem: 474M Active, 1609M Inact, 764M Wired, 281M Buf, 119M Free
> >> > Swap: 4096M Total, 917M Used, 3178M Free, 22% Inuse
> >> > 
> >> >   PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
> >> > COMMAND
> >> >  1819 imb              1  28    0   213M 11284K select  1 147:44   5.97%
> >> > gkrellm
> >> > 59238 imb             43  20    0   980M   424M select  0  10:07   1.92%
> >> > firefox
> >> > 
> >> >  .. it shouldn't start randomly swapping out processes because they're
> >> > used infrequently when there's more than enough RAM to spare ..    
> >> 
> >> I don't know what changed, and probably something can use some tweaking,
> >> but paging out idle processes isn't always the wrong thing to do.  For
> >> instance if I'm using poudriere to build a bunch of packages and its
> >> heavy use of tmpfs is pushing the machine into many GB of swap usage, I
> >> don't want interactive use like:
> >> 	vi foo.c
> >> 	cc foo.c
> >> 	vi foo.c
> >> to suffer because vi and cc have to be read in from a busy hard drive
> >> each time while unused console getty and idle sshd processes in a bunch
> >> of jails are still hanging on to memory even though they haven't
> >> executed any instructions since shortly after the machine was booted
> >> weeks ago.
> >>   
> >> > It also shows up when trying to reboot .. on all of my gear, 90 seconds
> >> > of "fail-safe" time-out is no longer enough when a good proportion of
> >> > daemons have been dropped onto swap and must be brought back in to flush
> >> > their data segments :-(    
> >> 
> >> That's a different and known problem.  See:
> >> <https://svnweb.freebsd.org/base/releng/10.3/bin/csh/config_p.h?revision=297204&view=markup>;  
> > 
> > CURRENT has rendered unusable and faulty. Updating ports for poudriere ends
> > up in this error/broken pipe from remote console:
> > 
> >  [~] poudriere ports -u -p head
> > [00:00:00] ====>> Updating portstree "head"
> > [00:00:00] ====>> Updating the ports tree... done
> > root@gate [~] Fssh_packet_write_wait: Connection to 192.168.250.111 port
> > 22: Broken pipe
> > 
> > 
> > Although not under load, several processes over time gets idled/paged out -
> > and they never recover, the connection is then sabott, the whole thing
> > unusable :-(  
> 
> I'm definitely not seeing that here.  This is getting close to the end
> of a big poudriere run:
> 
> last pid: 82549;  load averages: 20.05, 20.72, 23.51    up 5+12:34:14
> 12:51:55 144 processes: 20 running, 109 sleeping, 15 stopped
> CPU: 85.3% user,  0.0% nice, 14.7% system,  0.0% interrupt,  0.0% idle
> Mem: 1082M Active, 19G Inact, 9718M Wired, 249M Buf, 1095M Free
> ARC: 3841M Total, 2039M MFU, 642M MRU, 3395K Anon, 111M Header, 1044M Other
> Swap: 40G Total, 9691M Used, 31G Free, 23% Inuse, 196K In
> 
> At the moment, openoffice-4, openoffice-devel, libreoffice, and chromium
> are all being built and are using tmpfs for "wrkdir data localbase", so
> there are many GB of data in tmpfs, which is the reason for the high
> inact and swap usage.  I just hit the return key in an idle (for a
> couple of hours) terminal window containing an ssh login session to the
> same machine.  I got a fresh command prompt essentially instantaneously.
> It couldn't have taken more than a couple hundred milliseconds to wake
> up and page in the idle sshd and shell processes on the build server.
> 
> [a couple hours later, after poudriere is done and all tmpfs is gone]
> 
> last pid: 66089;  load averages:  0.13,  1.59,  4.61    up 5+14:14:33
> 14:32:14 71 processes:  1 running, 55 sleeping, 15 stopped
> CPU:  3.1% user,  0.0% nice,  0.0% system,  0.0% interrupt, 96.9% idle
> Mem: 58M Active, 85M Inact, 12G Wired, 249M Buf, 19G Free
> ARC: 6249M Total, 2792M MFU, 2246M MRU, 16K Anon, 133M Header, 1078M Other
> Swap: 40G Total, 81M Used, 40G Free
> 
> [after tracking down and exiting all of those stopped processes]
> 
> last pid: 66103;  load averages:  0.20,  0.99,  3.80    up 5+14:17:18
> 14:34:59 56 processes:  1 running, 55 sleeping
> CPU:  0.0% user,  0.0% nice,  0.1% system,  0.1% interrupt, 99.9% idle
> Mem: 57M Active, 88M Inact, 12G Wired, 249M Buf, 19G Free
> ARC: 6251M Total, 2793M MFU, 2247M MRU, 16K Anon, 133M Header, 1078M Other
> Swap: 40G Total, 63M Used, 40G Free
> 
> The biggest chunk of the 63 MB of swap appears to be nginx.  It's
> process size is 29 MB, but it has zero resident.  It hasn't executed any
> code since it was first started when I booted the system several days
> ago.  Other consumers appear to be getty and sshd and syslogd in various
> untouched jails.
> 
> 
> I've seen reports that r296137 and r297267 show the ssh problem, but
> this machine is in the middle with r297204 and I don't see it.
> 
> As mentioned previously, I'm not running Xorg and a bunch of bloated
> X11 clients on this machine.  Those make fat targets for having RAM
> taken from them, which would probably make my interactive experience
> less pleasant, but that should still not affect ssh.
> 
> On my FreeBSD 10 machine, which has only 8 GB of RAM, my experience is
> that firefox gets pretty bloated after a while.  It's currently at 2.6
> GB (with 2.8 GB of swap currently in use - I've got some other RAM hogs
> running as well) and I'm not seeing any problems, but when it gets up in
> the 4-5 GB range, things can start to get pretty laggy, but I don't see
> problems with ssh.  The biggest problem with firefox seems to be
> javascript, which seems to leak memory like a sieve.  Making heavy use
> of the noscript plugin is the only way to keep Firefox usable.
> 
> The only thing I can think of is that this is triggered by something in
> the machine configuration or the specific hardware.  I'm running a
> GENERIC kernel and the only non-standard modification to /usr/src is the
> dummynet AQM patchset.  The latter should have no effect since I"m not
> using ipfw on this machine.
> 
> If I get a chance, I try booting my FreeBSD 11 machine with less RAM to
> see if that is a trigger.

Several of my boxes do not run X11 or "... a bunch of bloated X11 clients" 
and they run with 8 GB, 16 GB or 32 GB of RAM (the latter one
does have X11). On all remote systems with most recent CURRENT (we are talking
about r297237 - 297369 tight now) I definitely do not get "immediately" a fresh
prompt. it takes up to 60 seconds (and more) to recover, even if the box is in
a state of unemployment (idle!). In a seriously rising bunch of cases I get now
broken pipes. This also happens with sessions, when performing "poudriere
options" on larger installations and this is completely unacceptable.


help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160329080840.3da929de>