From owner-freebsd-stable@freebsd.org  Sun Oct 15 20:45:15 2017
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 44170E4A989
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Sun, 15 Oct 2017 20:45:15 +0000 (UTC)
 (envelope-from fabian.freyer@physik.tu-berlin.de)
Received: from mail.tu-berlin.de (mail.tu-berlin.de [130.149.7.33])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 045DA115C
 for <freebsd-stable@freebsd.org>; Sun, 15 Oct 2017 20:45:14 +0000 (UTC)
 (envelope-from fabian.freyer@physik.tu-berlin.de)
X-tubIT-Incoming-IP: 130.149.50.25
Received: from mail.physik-pool.tu-berlin.de ([130.149.50.25]
 helo=mail.physik.tu-berlin.de)
 by mail.tu-berlin.de (exim-4.89/mailfrontend-5) with esmtp 
 id 1e3pmY-0002sJ-7O; Sun, 15 Oct 2017 22:45:12 +0200
Received: from [130.149.50.251] (aufsichtskiste.physik-pool.tu-berlin.de
 [130.149.50.251])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by mail.physik.tu-berlin.de (Postfix) with ESMTPSA id 73B3FB441;
 Sun, 15 Oct 2017 20:45:00 +0000 (UTC)
From: Fabian Freyer <fabian.freyer@physik.tu-berlin.de>
Subject: Re: kernel process [nfscl] high cpu
To: freebsd-stable@freebsd.org
Cc: rmacklem@uoguelph.ca, lists@searchy.net, zeha@physik.tu-berlin.de,
 steff@physik.tu-berlin.de
References: <1031959302.30289198.1430594914473.JavaMail.root@uoguelph.ca>
 <5603AE3D.5090407@searchy.net>
 <1887696626.8730412.1443097925392.JavaMail.zimbra@uoguelph.ca>
 <5603FC85.8070105@searchy.net>
 <924567664.10016007.1443129423733.JavaMail.zimbra@uoguelph.ca>
Message-ID: <39012cc6-a527-eb8e-3a6f-7e24cd9d98d8@physik.tu-berlin.de>
Date: Sun, 15 Oct 2017 22:45:00 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <924567664.10016007.1443129423733.JavaMail.zimbra@uoguelph.ca>
Content-Type: text/plain; charset=utf-8
Content-Language: de-DE
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Oct 2017 20:45:15 -0000

Hi,

(I'm not on this list, please CC me in future replies.
My apologies to those who get this message twice, I had a typo in the
To: header, and have to re-send it to the list.)

sorry for reviving such an old thread, but I've run into this problem
lately as well, on a 11.1-RELEASE-p1 jailhost mounting NFSv4.0 mounts
into jails.

On 24.09.2015 23:17, Rick Macklem wrote:
> Frank de Bot wrote:
>> Rick Macklem wrote:
>>> Frank de Bot wrote:
>>>> Rick Macklem wrote:
>>>>> Frank de Bot wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On a 10.1-RELEASE-p9 server I have several NFS mounts used for a
>>>>>> jail.
>>>>>> Because it's a server only to test, there is a low load. But the
>>>>>> [nfscl]
>>>>>> process is hogging a CPU after a while. This happens pretty fast,
>>>>>> within
>>>>>> 1 or 2 days. I'm noticing the high CPU of the process when I want to
>>>>>> do
>>>>>> some test after a little while (those 1 or 2 days).

Here's my ps ax | grep nfscl:
# ps ax | grep nfscl
11111  -  DL      932:08.74 [nfscl]
11572  -  DL      442:27.42 [nfscl]
30396  -  DL      933:44.13 [nfscl]
35902  -  DL      442:08.70 [nfscl]
40881  -  DL      938:56.04 [nfscl]
43276  -  DL      932:38.88 [nfscl]
49178  -  DL      934:24.77 [nfscl]
56314  -  DL      935:21.55 [nfscl]
60085  -  DL      936:37.11 [nfscl]
71788  -  DL      933:10.96 [nfscl]
82001  -  DL      934:45.76 [nfscl]
86222  -  DL      931:42.94 [nfscl]
92353  -  DL     1186:53.38 [nfscl]
21105 20  S+        0:00.00 grep nfscl

And this is on a 12-core with Hyperthreading:
# uptime
7:28PM  up 11 days,  4:50, 4 users, load averages: 25.49, 21.91, 20.25

Most of this load is being generated by the nfscl threads.

>>>>>> My jail.conf look like:
>>>>>>
>>>>>> exec.start = "/bin/sh /etc/rc";
>>>>>> exec.stop = "/bin/sh /etc/rc.shutdown";
>>>>>> exec.clean;
>>>>>> mount.devfs;
>>>>>> exec.consolelog = "/var/log/jail.$name.log";
>>>>>> #mount.fstab = "/usr/local/etc/jail.fstab.$name";
>>>>>>
>>>>>> test01 {
>>>>>>         host.hostname = "test01_hosting";
>>>>>>         ip4.addr = somepublicaddress;
>>>>>>         ip4.addr += someprivateaddress;
>>>>>>
>>>>>>         mount = "10.13.37.2:/tank/hostingbase      /opt/jails/test01
>>>>>>    nfs     nfsv4,minorversion=1,pnfs,ro,noatime        0       0";
>>>>>>         mount +=  "10.13.37.2:/tank/hosting/test
>>>>>> /opt/jails/test01/opt       nfs     nfsv4,minorversion=1,pnfs,noatime
>>>>>>      0       0";
>>>>>>
>>>>>>         path = "/opt/jails/test01";
>>>>>> }
>>>>>>
>>>>>> Last test was with NFS 4.1, I also worked with NFS 4.(0) with the
>>>>>> same
>>>>>> result. In the readonly nfs share there are symbolic links point to
>>>>>> the
>>>>>> read-write share for logging, storing .run files, etc. When I monitor
>>>>>> my
>>>>>> network interface with tcpdump, there is little nfs traffic, only
>>>>>> when I
>>>>>> do try to access the shares there is activity.
>>>>>>
>>>>>> What is causing nfscl to run around in circles, hogging the CPU (it
>>>>>> makes the system slow to respond too) or how can I found out what's
>>>>>> the
>>>>>> cause?
>>>>>>
>>>>> Well, the nfscl does server->client RPCs referred to as callbacks. I
>>>>> have no idea what the implications of running it in a jail is, but I'd
>>>>> guess that these server->client RPCs get blocked somehow, etc...
>>>>> (The NFSv4.0 mechanism requires a separate IP address that the server
>>>>>  can connect to on the client. For NFSv4.1, it should use the same
>>>>>  TCP connection as is used for the client->server RPCs. The latter
>>>>>  seems like it should work, but there is probably some glitch.)
>>>>>
>>>>> ** Just run without the nfscl daemon (it is only needed for delegations
>>>>> or
>>>>> pNFS).
>>>>
>>>> How can I disable the nfscl daemon?
>>>>
>>> Well, the daemon for the callbacks is called nfscbd.
>>> You should check via "ps ax", to see if you have it running.
>>> (For NFSv4.0 you probably don't want it running, but for NFSv4.1 you
>>>  do need it. pNFS won't work at all without it, but unless you have a
>>>  server that supports pNFS, it won't work anyhow. Unless your server is
>>>  a clustered Netapp Filer, you should probably not have the "pnfs" option.)
>>>
>>> To run the "nfscbd" daemon you can set:
>>> nfscbd_enable="TRUE"
>>> in your /etc/rc.conf will start it on boot.
>>> Alternately, just type "nfscbd" as root.
>>>
>>> The "nfscl" thread is always started when an NFSv4 mount is done. It does
>>> an assortment of housekeeping things, including a Renew op to make sure the
>>> lease doesn't expire. If for some reason the jail blocks these Renew RPCs,
>>> it will try to do them over and over and ... because having the lease
>>> expire is bad news for NFSv4. How could you tell?
>>> Well, capturing packets between the client and server, then looking at them
>>> in wireshark is probably the only way. (Or maybe a large count for Renew
>>> in the output from "nfsstat -e".)
>>>
>>> "nfscbd" is optional for NFSv4.0. Without it, you simply don't do
>>> callbacks/delegations.
>>> For NFSv4.1 it is pretty much required, but doesn't need a separate
>>> server->client TCP
>>> connection.
>>> --> I'd enable it for NFSv4.1, but disable it for NFSv4.0 at least as a
>>> starting point.
>>>
>>> And as I said before, none of this is tested within jails, so I have no
>>> idea
>>> what effect the jails have. Someone who understands jails might have some
>>> insight
>>> w.r.t. this?
>>>
>>> rick
>>>
>>
>> Since last time I haven't tried to use pnfs and just sticked with
>> nfsv4.0. nfscbd is not running. The server is now running 10.2. The
>> number of renews is not very high (56k, getattr is for example 283M)
>> View with wireshark, renew calls look good ,the nfs status is ok.
>>
>> Is there a way to know what [nfscl] is active with?
>>
> Not that I can think of. When I do "ps axHl" I see it in DL state and not
> doing much of anything. (You could try setting "sysctl vfs.nfs.debuglevel=4",
> but I don't think you'll see anything syslog'd that is useful?)
> This is what I'd expect for an NFSv4.0 mount without the nfscbd running.

nfscbd is not running.

I've done a bit of digging with kgdb, by attatching to a thread of one
of the nfscl's and grabbed a backtrace:

--8<---- snip
(kgdb) info threads
[...]
  678 Thread 101151 (PID=82001: nfscl)  0xffffffff80a9780a in
sched_switch ()
[...]
(kgdb) thread 678
[Switching to thread 678 (Thread 101151)]#0  0xffffffff80a9780a in
sched_switch ()
(kgdb) bt
#0  0xffffffff80a9780a in sched_switch ()
#1  0xffffffff80a75a35 in mi_switch ()
#2  0xffffffff80abaa82 in sleepq_timedwait ()
#3  0xffffffff80a75435 in _sleep ()
#4  0xffffffff8095eb7d in nfscl_renewthread ()
#5  0xffffffff80985204 in start_nfscl ()
#6  0xffffffff80a2f815 in fork_exit ()
#7  0xffffffff80ec3b7e in fork_trampoline ()
#8  0x0000000000000000 in ?? ()
--8<---- snap

So as you said, the nfscl thread is mostly hanging in a sleep. However,
every approx. 30-50 seconds (this corresponds quite well with the timing
of the renew calls on the wire) the CPU usage spikes to 100% on all 24
threads.

dtrace(1m)'ing during those spikes show a huge number (order of 10^5)
calls to nfscl_procdoesntexist and nfscl_cleanup_common:

Calls

  nfscl_postop_attr                                                12

  nfscl_loadattrcache                                              17

  nfscl_request                                                    22

  nfscl_deleggetmodtime                                            26
  nfscl_reqstart                                                   26
  nfscl_mustflush                                                  30

  nfscl_nodeleg                                                    34

  nfscl_cleanup_common                                          12005

  nfscl_procdoesntexist                                         12315


Times

  nfscl_loadattrcache                                           41949

  nfscl_deleggetmodtime                                         63014

  nfscl_postop_attr                                             65046

  nfscl_reqstart                                                77851

  nfscl_procdoesntexist                                      30187728
  nfscl_request                                             423855128
  nfscl_nodeleg                                             841033772
  nfscl_mustflush                                          4838119753
  nfscl_cleanup_common                                     6436207841

For reference, here's the dtrace(1m) script I'm using:

--8<---- snip
#!/usr/sbin/dtrace -Cs

#pragma D option quiet

fbt::nfscl_*:entry {
  @entries[probefunc] = count();
  traceme[probefunc] = 1;
  ts[probefunc] = timestamp;
}

fbt::nfscl_*:return
/traceme[probefunc]/
{
  @times[probefunc] = sum(timestamp - ts[probefunc]);

}

tick-5sec
{
  printa(@times);
  clear(@times);
  printa(@entries);
  clear(@entries);
}
--8<---- snap

For completeness, I guess I *should* synchronize this with the for(;;)
loop in nfscl_renewthread and count the actual number of process
terminations, but OTOH, I am extremely sure (just by observing the
number of running processes and PIDs) that it's nowhere near those counts.

Digging a little deeper, I see pfind(9), or more closely pfind_locked is
used in nfscl_procdoesntexist. What I do not understand yet is how
pfind(9) interacts with jails. Maybe someone(TM) could shed some light
on this?

> Basically, when the nfscbd isn't running the server shouldn't issue any
> delegations, because it shouldn't see a callback path (server->client TCP
> connection). Also, if you are using a FreeBSD NFS server, it won't issue
> delegations unless you've enabled that, which isn't the default.
> 
> Check to see your Delegs in "nfsstat -e" is 0.

It is.

> If it is, then all the nfscl should be doing is waking up once per second
> and doing very little except a Renew RPC once every 30-60sec. (A fraction of
> what the server's lease duration is.)

A tcpdump shows at least one renew call/reply pair per minute.

> The only thing I can think of that might cause it to run a lot would be some
> weirdness related to the TOD clock. It msleep()s for hz and also checks for
> time_uptime (which should be in resolution of seconds) != the previous time.
> (If the msleep()s were waking up too fequently, then it would loop around
>  doing not much of anything, over and over and over again...)
> 
> 
>> I do understand nfs + jails could have issues, but I like to understand
>> them.
>>

Fabian

[1]
https://lists.freebsd.org/pipermail/freebsd-hackers/2017-June/thread.html#51200