From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 6 11:20:18 2012 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 21C7B106566B; Sat, 6 Oct 2012 11:20:18 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id 1AFFC8FC0C; Sat, 6 Oct 2012 11:20:16 +0000 (UTC) Received: by mail-we0-f182.google.com with SMTP id x43so1983541wey.13 for ; Sat, 06 Oct 2012 04:20:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=G5XdW5gh/GjbNrWuIQkDnuWwceaifLYUubBEB8ftJrE=; b=zVmv4edKO66V/djE0bNJuGYHbUZ6Yg0Bme3OjaQbUjSFXl7sUr2ui7ON97BaigI5aV ztbr+6uTI4ZzTjPaAqRyLVDxfXc1a33Igma6roIJ/RMmIvOkeLojfq+h+FSZlqelzwUc uJS8QLy4md7/Tr/QbWdQIJO0nnLK5+2RqOJFoaJVFUW7mprFuTKu2Cj1ILdhIMAcuyTI 5M11c6YuSOwttqHP1L2xOeHrGa8/BMs7Udfz0sRaqfrvzpt+4hKJdgm+k8t8HprHbpAD 4U/ztalBhrWjkLmax3VU4UWcqlnBX8nBp+ndA0BJOC8WWRuTSLdQN2auIA34BT6XOrdm RKGQ== Received: by 10.216.207.163 with SMTP id n35mr6807055weo.220.1349522415533; Sat, 06 Oct 2012 04:20:15 -0700 (PDT) Received: from [10.0.0.86] ([93.152.184.10]) by mx.google.com with ESMTPS id cl8sm7638401wib.10.2012.10.06.04.20.13 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 06 Oct 2012 04:20:14 -0700 (PDT) Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=windows-1252 From: Nikolay Denev In-Reply-To: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> Date: Sat, 6 Oct 2012 14:20:11 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <3E7BCFB4-6EE6-48F5-ACA7-A615F3CE5BAC@gmail.com> References: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.1498) X-Mailman-Approved-At: Sat, 06 Oct 2012 13:36:21 +0000 Cc: freebsd-fs@freebsd.org, rmacklem@freebsd.org, hackers@freebsd.org, Garrett Wollman Subject: Re: NFS server bottlenecks X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 Oct 2012 11:20:18 -0000 On Oct 4, 2012, at 12:36 AM, Rick Macklem wrote: > Garrett Wollman wrote: >> <> said: >>=20 >>>> Simple: just use a sepatate mutex for each list that a cache entry >>>> is on, rather than a global lock for everything. This would reduce >>>> the mutex contention, but I'm not sure how significantly since I >>>> don't have the means to measure it yet. >>>>=20 >>> Well, since the cache trimming is removing entries from the lists, I >>> don't >>> see how that can be done with a global lock for list updates? >>=20 >> Well, the global lock is what we have now, but the cache trimming >> process only looks at one list at a time, so not locking the list = that >> isn't being iterated over probably wouldn't hurt, unless there's some >> mechanism (that I didn't see) for entries to move from one list to >> another. Note that I'm considering each hash bucket a separate >> "list". (One issue to worry about in that case would be cache-line >> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE >> ought to be increased to reduce that.) >>=20 > Yea, a separate mutex for each hash list might help. There is also the > LRU list that all entries end up on, that gets used by the trimming = code. > (I think? I wrote this stuff about 8 years ago, so I haven't looked at > it in a while.) >=20 > Also, increasing the hash table size is probably a good idea, = especially > if you reduce how aggressively the cache is trimmed. >=20 >>> Only doing it once/sec would result in a very large cache when >>> bursts of >>> traffic arrives. >>=20 >> My servers have 96 GB of memory so that's not a big deal for me. >>=20 > This code was originally "production tested" on a server with 1Gbyte, > so times have changed a bit;-) >=20 >>> I'm not sure I see why doing it as a separate thread will improve >>> things. >>> There are N nfsd threads already (N can be bumped up to 256 if you >>> wish) >>> and having a bunch more "cache trimming threads" would just increase >>> contention, wouldn't it? >>=20 >> Only one cache-trimming thread. The cache trim holds the (global) >> mutex for much longer than any individual nfsd service thread has any >> need to, and having N threads doing that in parallel is why it's so >> heavily contended. If there's only one thread doing the trim, then >> the nfsd service threads aren't spending time either contending on = the >> mutex (it will be held less frequently and for shorter periods). >>=20 > I think the little drc2.patch which will keep the nfsd threads from > acquiring the mutex and doing the trimming most of the time, might be > sufficient. I still don't see why a separate trimming thread will be > an advantage. I'd also be worried that the one cache trimming thread > won't get the job done soon enough. >=20 > When I did production testing on a 1Gbyte server that saw a peak > load of about 100RPCs/sec, it was necessary to trim aggressively. > (Although I'd be tempted to say that a server with 1Gbyte is no > longer relevant, I recently recall someone trying to run FreeBSD > on a i486, although I doubt they wanted to run the nfsd on it.) >=20 >>> The only negative effect I can think of w.r.t. having the nfsd >>> threads doing it would be a (I believe negligible) increase in RPC >>> response times (the time the nfsd thread spends trimming the cache). >>> As noted, I think this time would be negligible compared to disk I/O >>> and network transit times in the total RPC response time? >>=20 >> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >> network connectivity, spinning on a contended mutex takes a >> significant amount of CPU time. (For the current design of the NFS >> server, it may actually be a win to turn off adaptive mutexes -- I >> should give that a try once I'm able to do more testing.) >>=20 > Have fun with it. Let me know when you have what you think is a good = patch. >=20 > rick >=20 >> -GAWollman >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to >> "freebsd-hackers-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" I was doing some NFS testing with RELENG_9 machine and a Linux RHEL machine over 10G network, and noticed the same nfsd threads = issue. Previously I would read a 32G file locally on the FreeBSD ZFS/NFS server = with "dd if=3D/tank/32G.bin of=3D/dev/null bs=3D1M" to cache it = completely in ARC (machine has 196G RAM), then if I do this again locally I would get close to 4GB/sec read - = completely from the cache... But If I try to read the file over NFS from the Linux machine I would = only get about 100MB/sec speed, sometimes a bit more, and all of the nfsd threads are clearly visible in top. pmcstat also = showed the same mutex contention as in the original post. I've now applied the drc2 patch, and reruning the same test yields about = 960MB/s transfer over NFS=85 quite an improvement!