From owner-freebsd-current@freebsd.org Thu Nov 24 18:42:43 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 88828C53DCA for ; Thu, 24 Nov 2016 18:42:43 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-qk0-x236.google.com (mail-qk0-x236.google.com [IPv6:2607:f8b0:400d:c09::236]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4809B686 for ; Thu, 24 Nov 2016 18:42:43 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-qk0-x236.google.com with SMTP id q130so57339163qke.1 for ; Thu, 24 Nov 2016 10:42:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=e/xIEb9MjL1zBdYzzxNGxnFR+JKV6JwtId2PDZWYnQk=; b=uzHGX8rAMWsO4c7SRdhP0QmcjLMVVQTGz8+u3Azsl433r/iwbjOOoxT78QmTAGys8D j6jWU2xQRlBWAUNBdqn4uwoArxYn2U7VOMEcPXaFo3Jf/GJZu9X1mBQ3ocHdJ65W5YiT kh4X4Vt4huabOxnPe2aW+MgQZaebCCgZetmL07/vd6gpEHdK9wZPLMWRiAiBiaE44iX3 Cw0ofEn1yFfZ1SHA7+4jVj4DyXAYj8Si8fBlmYnZ3bRPClPAUvdH8iS7lTXsTJHziRWV xrb+a9n9ktFZDdmmGzpCRTVHEquRJ+uQts1u5mSMOLALC+ZP3SAK/s+7/zQ5dS0iD4fp KsGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=e/xIEb9MjL1zBdYzzxNGxnFR+JKV6JwtId2PDZWYnQk=; b=NEwjZUjpdZuyKMe/AGmHD2guhzkgUFlpu/bVUaT7Uw5FoYIWn2JHu4D81fxiGqi7Kq jkp3Sa9C/aX2690+b9o679XD/PswGj8zOJid0LvJUrDOVtYPLCIoM4xLcZ9KuYUIYdtr E9vYmduvQ56HTdnVEx2B6uDNEMLGr1n42VozM+mpA8dUsEEN5/f9+vBogbT4MguZxUqH y44NgwSk1V1nWiomkgBdamG3r68yiLKkB2prAEScLbIjyd36V+HKxnlCzbDgqGMxrQ+J v0xyIKP4WxuX/Eu0JWC1Pgv1tIBmu2Nw2WCm18vs7r4nzdqVzFxsBUeUegrZy4cpEabk AZ7Q== X-Gm-Message-State: AKaTC02xNi2wQs6nnHlu5te4atsxfM/c115tkv+u6hDjwkftO1QLkLy6/hQhcHsMzM2uYpMnzds33AYMKeuTFw== X-Received: by 10.55.104.208 with SMTP id d199mr3214037qkc.222.1480012962295; Thu, 24 Nov 2016 10:42:42 -0800 (PST) MIME-Version: 1.0 Sender: asomers@gmail.com Received: by 10.12.166.129 with HTTP; Thu, 24 Nov 2016 10:42:41 -0800 (PST) In-Reply-To: References: <20161124090811.GO54029@kib.kiev.ua> From: Alan Somers Date: Thu, 24 Nov 2016 11:42:41 -0700 X-Google-Sender-Auth: fki14GctE6bn_JVtiMd53awAgRY Message-ID: Subject: Re: NFSv4 performance degradation with 12.0-CURRENT client To: Rick Macklem Cc: Konstantin Belousov , FreeBSD CURRENT Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 24 Nov 2016 18:42:43 -0000 On Thu, Nov 24, 2016 at 5:53 AM, Rick Macklem wrote: > > On Wed, Nov 23, 2016 at 10:17:25PM -0700, Alan Somers wrote: >> I have a FreeBSD 10.3-RELEASE-p12 server exporting its home >> directories over both NFSv3 and NFSv4. I have a TrueOS client (based >> on 12.0-CURRENT on the drm-next-4.7 branch, built on 28-October) >> mounting the home directories over NFSv4. At first, everything is >> fine and performance is good. But if the client does a buildworld >> using sources on NFS and locally stored objects, performance slowly >> degrades. The degradation is most noticeable with metadata-heavy >> operations. For example, "ls -l" in a directory with 153 files takes >> less than 0.1 seconds right after booting. But the longer the >> buildworld goes on, the slower it gets. Eventually that same "ls -l" >> takes 19 seconds. When the home directories are mounted over NFSv3 >> instead, I see no degradation. >> >> top shows negligible CPU consumption on the server, and very high >> consumption on the client when using NFSv4 (nearly 100%). The >> NFS-using process is spending almost all of its time in system mode, >> and dtrace shows that almost all of its time is spent in >> ncl_getpages(). >> > A couple of things you could do when it slow (as well as what Kostik suggested): > - nfsstat -c -e on client and nfsstat -e -s on server, to see what RPCs are being done > and how quickly. (nfsstat -s -e will also show you how big the DRC is, although a > large DRC should show up as increased CPU consumption on the server) > - capture packets with tcpdump -s 0 -w test.pcap host > - then you can email me test.pcap as an attachment. I can look at it in wireshark > and see if there seem to protocol and/or TCP issues. (You can look at in wireshark > yourself, the look for NFS4ERR_xxx, TCP segment retransmits...) > > If you are using either "intr" or "soft" on the mounts, try without those mount options. > (The Bugs section of mount_nfs recommends against using them. If an RPC fails due to > these options, something called a seqid# can be "out of sync" between client/server and > that causes serious problems.) > --> These seqid#s are not used by NFSv4.1, so you could try that by adding > "minorversion=1" to your mount options. > > Good luck with it, rick I've reproduced the issue on stock FreeBSD 12, and I've also learned that nullfs is a required factor. Doing the buildworld directly on the NFS mount doesn't cause any slowdown, but doing a buildworld on the nullfs copy of the NFS mount does. The slowdown affects the base NFS mount as well as the nullfs copy. Here is the nfsstat output for both server and client duing "ls -al" on the client: nfsstat -e -s -z Server Info: Getattr Setattr Lookup Readlink Read Write Create Remove 800 0 121 0 0 2 0 0 Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access 0 0 0 0 0 0 0 8 Mknod Fsstat Fsinfo PathConf Commit LookupP SetClId SetClIdCf 0 0 0 0 1 3 0 0 Open OpenAttr OpenDwnGr OpenCfrm DelePurge DeleRet GetFH Lock 0 0 0 0 0 0 123 0 LockT LockU Close Verify NVerify PutFH PutPubFH PutRootFH 0 0 0 0 0 674 0 0 Renew RestoreFH SaveFH Secinfo RelLckOwn V4Create 0 0 0 0 0 0 Server: Retfailed Faults Clients 0 0 0 OpenOwner Opens LockOwner Locks Delegs 0 0 0 0 0 Server Cache Stats: Inprog Idem Non-idem Misses CacheSize TCPPeak 0 0 0 674 16738 16738 nfsstat -e -c -z Client Info: Rpc Counts: Getattr Setattr Lookup Readlink Read Write Create Remove 60 0 119 0 0 0 0 0 Rename Link Symlink Mkdir Rmdir Readdir RdirPlus Access 0 0 0 0 0 0 0 3 Mknod Fsstat Fsinfo PathConf Commit SetClId SetClIdCf Lock 0 0 0 0 0 0 0 0 LockT LockU Open OpenCfr 0 0 0 0 OpenOwner Opens LockOwner Locks Delegs LocalOwn LocalOpen LocalLOwn 5638 141453 0 0 0 0 0 0 LocalLock 0 Rpc Info: TimedOut Invalid X Replies Retries Requests 0 0 0 0 662 Cache Info: Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits Misses 1275 58 837 121 0 0 0 0 BioRLHits Misses BioD Hits Misses DirE Hits Misses 1 0 6 0 1 0 And here are the most popular stack traces of "ls -al", as observed by dtrace. The number beneath each stack is the number of times dtrace observed that exact stack: kernel`bcmp+0x21 kernel`vinactive+0xc6 kernel`vputx+0x30e kernel`kern_statat+0x165 kernel`sys_fstatat+0x2c kernel`amd64_syscall+0x314 kernel`vputx+0x30e kernel`NDFREE+0xaa kernel`sys___acl_get_link+0x82 kernel`amd64_syscall+0x314 kernel`0xffffffff80eb95fb 96 kernel`nfscl_doclose+0x383 kernel`vinactive+0xc6 kernel`vputx+0x30e kernel`NDFREE+0xaa kernel`sys___acl_get_link+0x82 kernel`amd64_syscall+0x314 kernel`0xffffffff80eb95fb 183 kernel`nfscl_doclose+0x383 kernel`vinactive+0xc6 kernel`vputx+0x30e kernel`kern_statat+0x165 kernel`sys_fstatat+0x2c kernel`amd64_syscall+0x314 kernel`0xffffffff80eb95fb 189 kernel`lock_delay+0x52 kernel`nfs_lookup+0x337 kernel`VOP_LOOKUP_APV+0xda kernel`lookup+0x6a2 kernel`namei+0x57e kernel`sys___acl_get_link+0x55 kernel`amd64_syscall+0x314 kernel`0xffffffff80eb95fb 194 kernel`lock_delay+0x52 kernel`ncl_getattrcache+0x28 kernel`nfs_getattr+0x92 kernel`VOP_GETATTR_APV+0xda kernel`vn_stat+0xa3 kernel`kern_statat+0xde kernel`sys_fstatat+0x2c kernel`amd64_syscall+0x314 kernel`0xffffffff80eb95fb 196 What role could nullfs be playing?