From owner-freebsd-net@freebsd.org Sat Apr 14 05:14:29 2018 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 417D8FA4C44 for ; Sat, 14 Apr 2018 05:14:29 +0000 (UTC) (envelope-from niels@kobschaetzki.net) Received: from hatchetman.psychedelicpirate.com (unknown [IPv6:2001:1560:a000:2:aa0a:3075:e5f:aab3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id AA52C715FD for ; Sat, 14 Apr 2018 05:14:28 +0000 (UTC) (envelope-from niels@kobschaetzki.net) Received: from hatchetman.psychedelicpirate.com (localhost [127.0.0.1]) by hatchetman.psychedelicpirate.com (OpenSMTPD) with ESMTP id 3952872d; Sat, 14 Apr 2018 07:14:23 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=kobschaetzki.net; h= subject:to:references:from:message-id:date:mime-version :in-reply-to:content-type:content-transfer-encoding; s= selector1; bh=CvuJU2Q1QIBRBN5DnLaUvQoZOFI=; b=ULUhCMx6DKnuCrfM2s gsl8wGfJdaDdSoFUQeY8yPa5mVNmJgWn9/Dxr5txxgI6VVItWzqoCus9W6YtNd5C hXYt9LOEUkSuAUgAlFA+GmvnhWfdAQsUUUH5aHxm0dboAuJPvmB/ViNAp93xKyVI myUCLWQ+OItnU0t+qJ4pEEO5o= DomainKey-Signature: a=rsa-sha1; c=nofws; d=kobschaetzki.net; h=subject :to:references:from:message-id:date:mime-version:in-reply-to :content-type:content-transfer-encoding; q=dns; s=selector1; b=X 1KRTC5xP5uw9ph4vgmzJPsCwEXO3QmK3djYjAm64dvSt+SgOpZVtysfcQCjOGKgi VLQMVHLEkkdg5DJxIMR+xbfNR/oCedW2gBHIbeDPjXvwhpKtunXHY3cJDeD2GYyY afx//+AIiVVZjfaauvrS6itLji+MbMhZfE4Zov9rPU= Received: from hatchetman.psychedelicpirate.com (localhost [127.0.0.1]) by hatchetman.psychedelicpirate.com (OpenSMTPD) with ESMTP id 2fd00461; Sat, 14 Apr 2018 07:14:18 +0200 (CEST) Received: from netcat.fritz.box (officevpn.InterDotNet.de [213.73.110.60]) by mail.kobschaetzki.net (OpenSMTPD) with ESMTPSA id d648a4fa (TLSv1.2:ECDHE-RSA-CHACHA20-POLY1305:256:NO); Sat, 14 Apr 2018 07:14:18 +0200 (CEST) Subject: Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release To: Rick Macklem , "freebsd-net@freebsd.org" References: From: =?UTF-8?Q?Niels_Kobsch=c3=a4tzki?= Message-ID: Date: Sat, 14 Apr 2018 07:14:17 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 Apr 2018 05:14:29 -0000 On 04/14/2018 03:49 AM, Rick Macklem wrote: > Niels Kobschätzki wrote: >> sorry for the cross-posting but so far I had no real luck on the forum >> or on question, thus I want to try my luck here as well. > I read email lists but don't do the other stuff, so I just saw this yesterday. > Short answer, I haven't a clue why cache hits rate would have changed. > > The code that decides if there is a hit/miss for the attribute cache is in > ncl_getattrcache() and the code hasn't changed between 10.3->11.1, > except the old code did a mtx_lock(&Giant), but I can't imagine how that > would affect the code. > > You might want to: > # sysctl -a | fgrep vfs.nfs > for both the 10.3 and 11.1 systems, to check if any defaults have somehow > been changed. (I don't recall any being changed, but??) I did that and there did nothing change. > If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c} > and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the > top, where it calculates "timeo" from it. > Running this hacked kernel might show you if either of these fields is bogus. > (You could then printf() "timeo" and "np->n_attrtimeo" just before the "if" > clause that increments "attrcache_misses", which is where the cache misses > happen to see why it is missing the cache.) > If you could do this for the 10.3 kernel as well, this might indicate why the > miss rate has increased? I will do this next week. On monday we switch for other reasons to other nfs-servers and when we see that they run stable, I will do this next. Btw. I calculated now the percentages. The old servers had a attr miss rate of something like 0.004%, while the upgraded one has more like 2.7%. This is till low from what I've read (I remember that you should start adjusting acreg* when you hit more than 40% misses) but far higher than before. nfsstat -c for one of the working servers looks like this (I did a -cz before to reset it and did this a couple of seconds later): Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits Misses 10085375 255 9163995 577 540 0 0 0 BioRLHits Misses BioD Hits Misses DirE Hits Misses Accs Hits Misses 1380 0 0 0 0 0 9169427 277 and for the non-working server: Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits Misses 1606365 20647 1418205 239 581 0 0 0 BioRLHits Misses BioD Hits Misses DirE Hits Misses Accs Hits Misses 895 0 0 0 0 0 1439080 337 >> I upgraded a machine from 10.3-Prerelease (custom kernel with >> tcp_fastopen added) to 11.1-Release (standard kernel) with >> freebsd-update. I have two other machines that are still on >> 10.3-Prerelease. Those machines mount an NFS-export from a >> Linux-NFS-server and use NFSv3. The machine that got upgraded shows now >> far more cache misses for getattr than on the 10.3-machines (we talk a >> factor of 100) in munin. munin also shows a lot more cache-misses for >> other metrics like biow, biorl, biod (where can I find what those >> metrics mean…currently I have not even an understanding what these are) >> etc. >> >> Can anybody help me how I can debug this problem or has an idea what >> could cause the problem? The result of this behavior is that this >> machine shows a lower performance than the others and I cannot upgrade >> other machines before I didn't fix this bug. > I haven't run a 10.x system in quite a while. When I get home in a few days, > I might be able to reproduce this. If I can. I can poke at it, but it would be at > least a week before I might have an answer and I may not figure it out for a > long time. Ok, thanks a lot. That would be great. Niels