From owner-freebsd-net@freebsd.org Sun Apr 15 11:10:38 2018 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 78EA5F88F27 for ; Sun, 15 Apr 2018 11:10:38 +0000 (UTC) (envelope-from niels@kobschaetzki.net) Received: from hatchetman.psychedelicpirate.com (unknown [IPv6:2001:1560:a000:2:d24c:4cfd:13f6:51c]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C74A76E8D1 for ; Sun, 15 Apr 2018 11:10:37 +0000 (UTC) (envelope-from niels@kobschaetzki.net) Received: from hatchetman.psychedelicpirate.com (localhost [127.0.0.1]) by hatchetman.psychedelicpirate.com (OpenSMTPD) with ESMTP id 7b289767; Sun, 15 Apr 2018 13:10:30 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=kobschaetzki.net; h= content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; s= selector1; bh=65Z+9+xgfYzJG5m9VH9JSFyv+sM=; b=gcWb6L/sQqKpvjBmXP Hqex9eNParl0icfyyABeNvhnabJB1ZX4GKOHvdJxOrMBUTzB5pRgGSDHEXOCyJep wZKAUL+HmXTokcbZshCPmhuvwZK/5jadCXNYkiNQ9LMcGOXMWam6gaDpHRozQ2ac lWFAUt6Gw+c714AvpP307k+uo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=kobschaetzki.net; h= content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; q=dns; s= selector1; b=esEn5jWgCP8x3B1UHnLI10HPNDdKPWJXnQ0dSUAAFjFBmoNBipb hPv+Gw/JedYTkUo6CsH3yVV9NdgDV3xCxtaDsxVZvVgJykifI7fdaYcEAMNVx/Q0 H9ScHcV0cgGjIxkk3QkLGIJ5rq5SGplZu0WiHVIpS1DnJvBkMjQHC69o= Received: from hatchetman.psychedelicpirate.com (localhost [127.0.0.1]) by hatchetman.psychedelicpirate.com (OpenSMTPD) with ESMTP id ec15ba85; Sun, 15 Apr 2018 13:10:25 +0200 (CEST) Received: from [10.212.124.176] (tmo-103-177.customers.d1-online.com [80.187.103.177]) by mail.kobschaetzki.net (OpenSMTPD) with ESMTPSA id 68b9e2c6 (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256:NO); Sun, 15 Apr 2018 13:10:25 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (1.0) Subject: Re: High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release From: Niels Kobschaetzki X-Mailer: iPhone Mail (15E216) In-Reply-To: Date: Sun, 15 Apr 2018 13:10:24 +0200 Cc: "freebsd-net@freebsd.org" Content-Transfer-Encoding: quoted-printable Message-Id: <36907CE0-EAD3-4E11-8023-5BCEA1239813@kobschaetzki.net> References: To: Rick Macklem X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Apr 2018 11:10:38 -0000 > On 15. Apr 2018, at 01:18, Rick Macklem wrote: >=20 > Niels Kobsch=C3=A4tzki wrote: >>> On 04/14/2018 03:49 AM, Rick Macklem wrote: >>> Niels Kobsch=C3=A4tzki wrote: >>>> sorry for the cross-posting but so far I had no real luck on the forum >>>> or on question, thus I want to try my luck here as well. >>> I read email lists but don't do the other stuff, so I just saw this yest= erday. >>> Short answer, I haven't a clue why cache hits rate would have changed. >>>=20 >>> The code that decides if there is a hit/miss for the attribute cache is i= n >>> ncl_getattrcache() and the code hasn't changed between 10.3->11.1, >>> except the old code did a mtx_lock(&Giant), but I can't imagine how that= >>> would affect the code. >>>=20 >>> You might want to: >>> # sysctl -a | fgrep vfs.nfs >>> for both the 10.3 and 11.1 systems, to check if any defaults have someho= w >>> been changed. (I don't recall any being changed, but??) >>=20 >> I did that and there did nothing change. >>=20 >>> If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c= } >>> and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the >>> top, where it calculates "timeo" from it. >>> Running this hacked kernel might show you if either of these fields is b= ogus. >>> (You could then printf() "timeo" and "np->n_attrtimeo" just before the "= if" >>> clause that increments "attrcache_misses", which is where the cache miss= es >>> happen to see why it is missing the cache.) >>> If you could do this for the 10.3 kernel as well, this might indicate wh= y the >>> miss rate has increased? >>=20 >> I will do this next week. On monday we switch for other reasons to other >> nfs-servers and when we see that they run stable, I will do this next. > With a miss rate of 2.7%, I doubt printing the above will help. I thought > you were seeing a high miss rate. It is low but increased by nearly a factor of 1000 to before. I hope the pri= nt will help. Just a lot of grepping through wherever I can get this data.=20= >> Btw. I calculated now the percentages. The old servers had a attr miss >> rate of something like 0.004%, while the upgraded one has more like >> 2.7%. This is till low from what I've read (I remember that you should >> start adjusting acreg* when you hit more than 40% misses) but far higher >> than before. > You could try increasing acregmin, acregmax and see if the misses are redu= ced. > (The only risk with increasing the cache timeout is that, if another clien= t changes > the attributes, then the client will use stale ones for longer. Usually, t= his doesn't > cause serious problems.) I tried that and it had exactly no effect > To be honest, a Getattr RPC is pretty low overhead, so I doubt the increas= e > to 2.7% will affect your application's performance, but it is interesting t= hat > it increased. It is a website with quite some traffic handles by three webservers behind a= pair of loadbalancers.=20 We see a loss of 20% in speed(TTFB reduced by 100ms; sounds not a lot but Go= ogle et al doesn=E2=80=99t like it at all) after upgrading to 11.1 with a co= mbined upgrade to php7.1. On another server without NFS that upgrade improve= d performance considerably (I was told ca 30% by the front end-dev) > You might also try increasing acdirmin, acdirmax in case it is the directo= ry > attributes that are having cache misses. I did that, too > Oh, and check that your time of day clocks are in sync with the server, > since the caches are time based, since there is no cache coherency protoco= l > in NFS. I checked that. All three frontends are using the same server for ntp Thanks so far, Niels=