From owner-freebsd-current@freebsd.org  Wed Oct 26 16:45:21 2016
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 15B12C22A5B
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Wed, 26 Oct 2016 16:45:21 +0000 (UTC) (envelope-from uqs@FreeBSD.org)
Received: from acme.spoerlein.net (acme.spoerlein.net
 [IPv6:2a02:2528:fa:1000::1])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "*.spoerlein.net", Issuer "CAcert Class 3 Root" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id B64E4196;
 Wed, 26 Oct 2016 16:45:20 +0000 (UTC) (envelope-from uqs@FreeBSD.org)
Received: from localhost (acme.spoerlein.net [IPv6:2a02:2528:fa:1000:0:0:0:1])
 by acme.spoerlein.net (8.15.2/8.15.2) with ESMTPS id u9QGjI3t077669
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
 verify=NO); Wed, 26 Oct 2016 18:45:18 +0200 (CEST)
 (envelope-from uqs@FreeBSD.org)
Date: Wed, 26 Oct 2016 18:45:18 +0200
From: Ulrich =?utf-8?B?U3DDtnJsZWlu?= <uqs@FreeBSD.org>
To: Kevin Oberman <rkoberman@gmail.com>, kib@FreeBSD.org,
 Hans Petter Selasky <hps@selasky.org>,
 FreeBSD Current <freebsd-current@freebsd.org>
Subject: Re: 11.x deadlocking during pfault (was Re: FreeBSD 11.x grinds to a
 halt after about 48h of uptime)
Message-ID: <20161026164518.GD2734@acme.spoerlein.net>
Mail-Followup-To: Kevin Oberman <rkoberman@gmail.com>, kib@FreeBSD.org,
 Hans Petter Selasky <hps@selasky.org>,
 FreeBSD Current <freebsd-current@freebsd.org>
References: <20161015161848.GD2532@acme.spoerlein.net>
 <6926bd72-35c9-cb21-4785-b50a05e581be@selasky.org>
 <CAN6yY1toUPRt3zBBJtgDVoFJCmZdgoGaUXznS9BbooN9BbfeEQ@mail.gmail.com>
 <20161024174327.GB2734@acme.spoerlein.net>
 <20161026164343.GC2734@acme.spoerlein.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20161026164343.GC2734@acme.spoerlein.net>
User-Agent: Mutt/1.7.0 (2016-08-17)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Oct 2016 16:45:21 -0000

On Wed, 2016-10-26 at 18:43:43 +0200, Ulrich Spörlein wrote:
> On Mon, 2016-10-24 at 19:43:27 +0200, Ulrich Spörlein wrote:
> > On Sat, 2016-10-15 at 09:36:27 -0700, Kevin Oberman wrote:
> > > On Sat, Oct 15, 2016 at 9:26 AM, Hans Petter Selasky <hps@selasky.org>
> > > wrote:
> > > 
> > > > On 10/15/16 18:18, Ulrich Spörlein wrote:
> > > >
> > > >> Hey all, while 11.x is -STABLE now, this happens to my machine ever
> > > >> since I upgraded it to 11-CURRENT years ago. I have no idea when this
> > > >> started, actually, but what always happens is this:
> > > >>
> > > >> - System and X11 is up and running, I keep it running over night as I'm
> > > >> too lazy to reboot and restart everthing.
> > > >> - There's a bunch of xterms, Chrome, Clementine-Player and some other
> > > >> programs running
> > > >> - Coming back to the machine the next day (or the day after) it will
> > > >> exit the screensaver just fine and then either I can use it for a couple
> > > >> of seconds before it freezes, or it's pretty much dead already. The
> > > >> mouse cursor still moves for a bit, but the also freezes (so it this a
> > > >> GPU problem??)
> > > >>
> > > >> Now what I currently see on the screen is a clock widget stuck at 18:04
> > > >> but conky itself has last updated at 18:00:18 ...
> > > >>
> > > >> This time I had some SSH sessions from another machine to see some more
> > > >> useful things. There was nothing in various logs under /var/log (I also
> > > >> can't run dmesg anymore ...)
> > > >> I had top(1) running in a loop, this is the last output:
> > > >>
> > > >> last pid: 25633;  load averages:  0.27,  0.39,  0.36  up 1+23:03:28
> > > >> 18:00:12
> > > >> 202 processes: 2 running, 188 sleeping, 11 zombie, 1 waiting
> > > >>
> > > >> Mem: 8873M Active, 1783M Inact, 5072M Wired, 567M Buf, 132M Free
> > > >> ARC: 1844M Total, 469M MFU, 268M MRU, 16K Anon, 96M Header, 1012M Other
> > > >> Swap: 4096M Total, 2395M Used, 1701M Free, 58% Inuse
> > > >>
> > > >>
> > > >>   PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
> > > >> COMMAND
> > > >>    11 root            8 155 ki31     0K   128K CPU0    0 364.6H 772.95%
> > > >> idle
> > > >>              3122 uqs            15  28    0  7113M  5861M uwait   0
> > > >> 94:44  13.96% chrome
> > > >>                            2887 uqs            28  22    0  1394M   237M
> > > >> select  2 172:53   6.98% chrome
> > > >>                                        2890 uqs            11  21    0
> > > >> 1034M   178M select  5 231:21   1.95% chrome
> > > >>                                                    1062 root            9
> > > >> 21    0   440M 47220K select  0  67:09   0.98% Xorg
> > > >>                                                              3002 uqs
> > > >>       15  25    5  1159M   172M uwait   2  19:09   0.00% chrome
> > > >>  3139 uqs            17  25    5  1163M   156M uwait   2  16:15   0.00%
> > > >> chrome
> > > >>  3001 uqs            18  25    5  1639M   575M uwait   0  16:05   0.00%
> > > >> chrome
> > > >>    12 root           24 -64    -     0K   384K WAIT   -1  10:53   0.00%
> > > >> intr
> > > >>  3129 uqs            12  20    0  2820M  1746M uwait   6   8:36   0.00%
> > > >> chrome
> > > >>  2822 uqs             9  20    0   217M 81300K select  0   5:10   0.00%
> > > >> conky
> > > >>  3174 root            1  20    0 21532K  3188K select  0   4:20   0.00%
> > > >> systat
> > > >>  3130 uqs            16  20    0  1058M   131M uwait   4   3:03   0.00%
> > > >> chrome
> > > >>  2998 uqs            16  20    0  1110M   123M uwait   2   2:53   0.00%
> > > >> chrome
> > > >>  3165 uqs            10  20    0  1209M   215M uwait   6   2:52   0.00%
> > > >> chrome
> > > >>  3142 uqs            11  25    5  1344M   195M uwait   2   2:46   0.00%
> > > >> chrome
> > > >>  2876 uqs            19  20    0   580M 37164K select  3   2:42   0.00%
> > > >> clementine-player
> > > >>    20 root            2 -16    -     0K    32K psleep  6   2:25   0.00%
> > > >> pagedaemon
> > > >>
> > > >> I also had systat -vm running and it continued to update its screen ...
> > > >> for a short while, this is the last update before SSH died:
> > > >>
> > > >>
> > > >>    Mem usage:  0k%Phy  5%Kmem
> > > >> Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP
> > > >> PAGER
> > > >>         Tot   Share      Tot    Share    Free           in   out     in
> > > >>  out
> > > >> Act  11051k   67868 71051992   255448   61840  count
> > > >> All  11051k   67924 71058776   262100          pages
> > > >> Proc:
> > > >> Interrupts
> > > >>   r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt   224
> > > >> total
> > > >>      25     730  11   724  109  404  101   13             cow       2
> > > >> ehci0 16
> > > >>                                                           zfod      3
> > > >> ehci1 23
> > > >>  0.0%Sys   0.1%Intr  0.0%User  0.0%Nice 99.9%Idle         ozfod    16
> > > >> cpu0:timer
> > > >> |    |    |    |    |    |    |    |    |    |           %ozfod
> > > >>  xhci0 264
> > > >>                                                           daefr     3 em0
> > > >> 265
> > > >>                                         50 dtbuf          prcfr    94
> > > >> hdac1 266
> > > >> Namei     Name-cache   Dir-cache    349167 desvn          totfr
> > > >>  ahci0 270
> > > >>    Calls    hits   %    hits   %    349155 numvn          react     5
> > > >> cpu1:timer
> > > >>      121     121 100                253501 frevn          pdwak     1
> > > >> cpu2:timer
> > > >>                                                           pdpgs    29
> > > >> cpu7:timer
> > > >> Disks   md0  ada0  ada1 pass0 pass1 pass2                 intrn    12
> > > >> cpu3:timer
> > > >> KB/t   0.00  0.00  0.00  0.00  0.00  0.00         5318892 wire     41
> > > >> cpu6:timer
> > > >> tps       0     0     0     0     0     0         9261404 act      12
> > > >> cpu5:timer
> > > >> MB/s   0.00  0.00  0.00  0.00  0.00  0.00         1598184 inact     6
> > > >> cpu4:timer
> > > >> %busy     0     0     0     0     0     0                 cache
> > > >>  vgapci0
> > > >>                                                     61840 free
> > > >>                                                    712304 buf
> > > >>
> > > >>
> > > >> Why do I have a Chrome tab using about 6G? What other sort of debugging
> > > >> output can be helpful to get to the bottom of this? The machine still
> > > >> responds to pings just fine, TCP connections get set up but the SSH
> > > >> handshake never completes.
> > > >>
> > > >> This always happens between 30-50h and is super annoying and has been
> > > >> going on for >1year. Help?
> > > >>
> > > >> Note, I cut the power to the monitor overnight to save electricity, can
> > > >> this mess up something in the Radeon card or X server? What combinations
> > > >> would be most useful to try next?
> > > >>
> > > >>
> > > > Hi,
> > > >
> > > > Sounds like a memory leak. Can you track the memory use over time?
> > > >
> > > > Did you look at the output from:
> > > >
> > > > vmstat -m ?
> > > >
> > > > --HPS
> > > 
> > > 
> > > I have noted significant  memory leakage in chromium for some time. If I
> > > leave it running overnight, my system is essentially frozen. If I terminate
> > > the chromium process, it slowly comes back to life. I always keep a gkrellm
> > > session on-screen where the memory and swap utilization is continuously
> > > displayed and that clearly shows resources declining.
> > 
> > That is not what is happening to my system though, it actually
> > deadlocks. There's no way to recover from it, it seems.
> > 
> > So I killed Chromium overnight each day, and I'm at this:
> > 
> > % top -Sbores
> > last pid: 44526;  load averages:  0.10,  0.11,  0.56  up 7+09:53:30    19:33:25
> > 156 processes: 2 running, 153 sleeping, 1 waiting
> > 
> > Mem: 315M Active, 550M Inact, 5671M Wired, 515M Buf, 9324M Free
> > ARC: 1852M Total, 541M MFU, 196M MRU, 16K Anon, 93M Header, 1022M Other
> > Swap: 4096M Total, 2186M Used, 1910M Free, 53% Inuse
> > 
> > 
> >   PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
> >  2755 uqs            10  20    0  1697M   311M select  1  47:23   0.00% conky
> >  2736 uqs            32  20    0   699M   116M select  7  94:29   0.00% clementine-player
> >  3000 uqs            12  20    0  1126M 69380K select  5   9:48   0.00% digikam
> >   960 root            9  20    0   448M 59076K select  0 250:22   0.00% Xorg
> > 72608 uqs             8  20    0   939M 55432K uwait   5   0:01   0.00% chrome
> > 72599 uqs             9  52    0   929M 55116K uwait   0   0:00   0.00% chrome
> >  2567 root            1  20    0 89948K 42964K select  1   1:51   0.00% bsnmpd
> > 70476 uqs             1  20    0 93656K 25712K select  2   0:05   0.00% xterm
> >  2730 uqs             5  20    0   208M 14988K select  1   0:22   0.00% clock-applet
> >   880 root            1  20    0 22628K 12500K select  3   0:20   0.00% ntpd
> >  2726 uqs             4  20    0   206M 12456K select  6   0:09   0.00% mateweather-applet
> > 44352 uqs             1  20    0 75224K 12348K select  4   0:00   0.00% xterm
> > 43049 uqs             1  20    0 75224K 11792K select  5   0:00   0.00% xterm
> >  3074 uqs             2  20    0   308M  9692K select  1   0:02   0.00% kdeinit4
> >  2671 uqs             1  20    0   144M  9488K select  1   0:13   0.00% openbox
> >  3072 uqs             1  20    0   210M  8284K select  3   0:00   0.00% kdeinit4
> >  2724 uqs             4  20    0   154M  8256K select  2   0:19   0.00% wnck-applet
> >  2701 uqs             5  20    0   177M  8144K select  2   0:01   0.00% mate-panel
> > 
> > 
> > 7d running, pretty good. But look closer, the system is doing pretty
> > much nothing but did swap out 2G. What?
> > 
> > > Try closing your chromium at night and see if that fixes the problem.
> > 
> > It's better, but I'm not sure it's a real fix. I've now turned off
> > "hardware acceleration" in Chromium, though chrome://gpu didn't real
> > inspire confidence that it was actually using any h/w accel at all.
> > 
> > 
> > > If you have never tried gkrellm (sysutils/gkrellm2), it is a the best
> > > system monitor I have found. though pulls in a lot of dependencies. It also
> > > can run as a server with remote systems displaying the data. Handy to
> > > monitor servers.
> > 
> > I had a cacti-setup that would also monitor my workstation (through a
> > OpenVPN tunnel), but that has bit-rotted and Apache only gives me 500s
> > on that cacti URL and nothing in the logs, oh well ...)
> > 
> > Hooking up a serial console and testing whether DDB works is probably
> > the next best step to take ...
> 
> Sigh, I forgot to shut down Chrome overnight, and my system is
> deadlocked again. I can still switch virtual desktops in X11 and a
> running xterm accepts keyboard input, but it is 18:35 now and I see that
> my X11 clock has last updated at 17:18 and conky is stuck at 17:14:11
> 
> my top-in-a-loop is stuck here and no longer loops:
> 
> last pid: 73731;  load averages:  0.23,  0.24,  0.23  up 9+07:34:15    17:14:10
> 160 processes: 3 running, 146 sleeping, 11 zombie
> 
> Mem: 9302M Active, 763M Inact, 5682M Wired, 752M Buf, 113M Free
> ARC: 1731M Total, 549M MFU, 129M MRU, 16K Anon, 91M Header, 963M Other
> Swap:
> 
> 
>   PID USERNAME      THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
>   960 root            9  21    0   451M 55796K CPU7    7 299:15   2.98% Xorg
> 53884 uqs            27  22    0   904M   250M CPU4    4 105:05   2.98% chrome
> 54081 uqs            15  21    0  3601M  2527M uwait   6  38:11   0.98% chrome
>  2736 uqs            33  20    0   697M 84620K select  5  95:38   0.00% clementine-player
>  2755 uqs            10  20    0  2111M   721M select  0  60:47   0.00% conky
> 78943 uqs             1  20    0 21532K  3328K select  7  13:12   0.00% systat
> 54048 uqs            15  25    5  1093M   159M uwait   4   9:51   0.00% chrome
>  3000 uqs            12  20    0  1126M 53248K select  7   9:50   0.00% digikam
> 53999 uqs            19  25    5  1525M   514M uwait   5   6:28   0.00% chrome
>  2703 uqs             1  20    0   169M  5948K select  0   5:40   0.00% mate-volume-control
>  2707 uqs             1  20    0 60240K  2516K select  7   5:21   0.00% autocutsel
> 54077 uqs            15  20    0  1803M   821M uwait   6   3:51   0.00% chrome
> 78869 uqs             1  20    0 93480K  6636K select  7   3:38   0.00% sshd
> 38396 uqs             1  20    0 75224K  3896K select  0   3:18   0.00% xterm
>  2567 root            1  20    0 89948K 43892K select  6   2:25   0.00% bsnmpd
>   876 root            9  52    0 30120K  3468K uwait   7   2:20   0.00% nscd
>   883 root            1  20    0 10444K  1708K select  0   2:16   0.00% powerd
>  2586 haldaemon       2  20    0 56620K  3920K select  1   2:04   0.00% hald
> 
> 
> systat -vm:
> 
>    10 users    Load  0.21  0.23  0.23                  Oct 26 17:14
>    Mem usage:  0k%Phy  5%Kmem
> Mem: KB    REAL            VIRTUAL                      VN PAGER   SWAP PAGER
>         Tot   Share      Tot    Share    Free           in   out     in   out
> Act  10545k   51036 67451248   249344   61912  count   120     3
> All  10550k   55504 67480212   275720          pages   807     3
> Proc:                                                            Interrupts
>   r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt    149 ioflt   845 total
>       6     781  12  9327 6923  24k  318   18 6649     20 cow       2 ehci0 16
>                                                      5064 zfod      3 ehci1 23
>  4.7%Sys   0.0%Intr  1.7%User  0.0%Nice 93.6%Idle      16 ozfod   106 cpu0:timer
> |    |    |    |    |    |    |    |    |    |           %ozfod       xhci0 264
> ==>                                                   927 daefr    45 em0 265
>                                         51 dtbuf      545 prcfr    94 hdac1 266
> Namei     Name-cache   Dir-cache    349167 desvn     1787 totfr   175 ahci0 270
>    Calls    hits   %    hits   %    337762 numvn          react    50 cpu1:timer
>      935     901  96                170078 frevn          pdwak    48 cpu3:timer
>                                                   1435985 pdpgs    66 cpu2:timer
> Disks   md0  ada0  ada1 pass0 pass1 pass2                 intrn    77 cpu7:timer
> KB/t   0.00 22.59  0.00  0.00  0.00  0.00         5819100 wire     75 cpu4:timer
> tps      17   168     0     0     0     0         9723008 act      51 cpu6:timer
> MB/s   0.00  3.70  0.00  0.00  0.00  0.00          636300 inact    53 cpu5:timer
> %busy     0     3     0     0     0     0                 cache       vgapci0
>                                                     61912 free
>                                                    770732 buf
> 
> 
> I'm unable to scroll back the vmstat -m output in my tmux pane (running on a
> different system, this is super strange), so all I can show is this:
> 
> 
>      inodedep    16 12293K       -  2770872  512
>     bmsafemap     9    49K       -  2136243  256,8192
>        newblk    28 24582K       -  4111401  256
>      indirdep     4     1K       -   865983  128,16384
>      freefrag     6     1K       -   682027  128
>      freeblks     3     1K       -  1318985  256
>      freefile     3     1K       -  1272832  64
>        diradd     2     1K       -  1306741  128
>         mkdir     0     0K       -     5120  128
>        dirrem     0     0K       -  1305358  128
>     newdirblk     0     0K       -     2610  64
>      freework     9     1K       -  1566489  64,128
>         sbdep     0     0K       -    45661  64
>      savedino     0     0K       -   186280  256
>       softdep     6     3K       -        6  512
>   ufs_dirhash  1533   767K       -   109513  16,32,64,128,256,512,1024,2048,4096
>     ufs_quota     1  2048K       -        1
>     ufs_mount    18    55K       -       18  512,2048,4096,8192
>     vm_pgdata     1  2048K       -        2  128
>       UMAHash    23  5385K       -      108  512,1024,2048,4096,8192,16384,32768,65536
>       memdesc     1     4K       -        1  4096
>      atkbddev     2     1K       -        2  64
>       entropy     1     1K       -   894489  32,4096
>        apmdev     1     1K       -        1  128
>    madt_table     0     0K       -        1  4096
>      SCSI ENC    25   100K       -   120744  16,64,256,2048,32768
>       io_apic     1     2K       -        1  2048
>           MCA    18     3K       -       18  64,128
>           msi    16     2K       -       16  128
>      nexusdev     5     1K       -        5  16
>          hdaa     5    54K       -        5  2048,16384,32768
>          hdac     1     1K       -        1  1024
>         hdacc     1     1K       -        1  32
>         linux    29     2K       -       29  64
>       solaris 295750 228696K       - 140520323  16,32,64,128,256,512,1024,2048,4096,8192,32768
>    kstat_data     6     1K       -        6  64
>      eli data    22     4K       -  6218901  64,256,512,1024,2048,4096,8192,16384,32768,65536
>          ksem     1     8K       -        1  8192
>   nullfs_hash     1  2048K       -        1
>   nullfs_node     9     1K       -       41  64
>  nullfs_mount     9     1K       -        9  32
>   fdesc_mount     1     1K       -        1  16
>      gem_name    46    14K       -      122  32,4096
>    drm_global     2     1K       -        2  128,256
>       drm_dma     1     1K       -        1  32
>     drm_sarea     1     1K       -        1  16
>    drm_driver    91  2278K       -      125  16,32,64,128,256,512,1024,2048,4096,8192,32768
>     drm_minor     2     1K       -        2  128
>     drm_files     1     1K       -        3  512
> drm_ctxbitmap     1     4K       -        1  4096
>      drm_sman    41     6K       -       42  128
>   drm_hashtab     3  4129K       -        4  128,32768
>       drm_kms    69    19K       -      163  16,32,64,128,256,512,2048,4096,8192
>    drm_vblank     7     1K       -        7  32,256
>        ttm_pd    16    17K       -       18  16,128,2048,65536
>      ttm_rman     2     1K       -        2  256
>      ttm_zone     2     1K       -        2  64
>   ttm_poolmgr     1     1K       -        1  512
> 
> 
> Now what?
> 
> The xterm I have running locally with a stuck top is showing the top 3 chrome
> processes in pfault state and it has "Swap: 11M In" in the header, so clearly
> 11.x is prone to deadlock during page faults and or swapping. It has last
> updated on 17:14:13 (compare to the other top at 17:14:10 not showing pfault
> state yet).

Addendum: I still see the HDD indication light flickering every couple
of seconds, so something is still doing I/O. My SSH sessions into the
machine haven't time out, and the screensaver (just DPMS blank) is also
still kicking in correctly.