Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Oct 2012 08:28:29 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Garrett Wollman <wollman@freebsd.org>
Cc:        rmacklem@freebsd.org, hackers@freebsd.org, jkoshy@freebsd.org
Subject:   Re: NFS server bottlenecks
Message-ID:  <499414315.1544891.1349180909058.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <20586.27582.478147.838896@hergotha.csail.mit.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
Garrett Wollman wrote:
> I had an email conversation with Rick Macklem about six months ago
> about NFS server bottlenecks. I'm now in a position to observe my
> large-scale NFS server under an actual production load, so I thought I
> would update folks on what it looks like. This is a 9.1 prerelease
> kernel (I hope 9.1 will be released soon as I have four moe of these
> servers to deploy!). When under nearly 100% load on an 8-core
> (16-thread) Quanta QSSC-S99Q storage server, with a 10G network
> interface, pmcstat tells me this:
> 
> PMC: [INST_RETIRED.ANY_P] Samples: 2727105 (100.0%) , 27 unresolved
> Key: q => exiting...
> %SAMP IMAGE FUNCTION CALLERS
> 29.3 kernel _mtx_lock_sleep nfsrvd_updatecache:10.0
> nfsrvd_getcache:7.4 ...
> 9.5 kernel cpu_search_highest cpu_search_highest:8.1 sched_idletd:1.4
> 7.4 zfs.ko lzjb_decompress zio_decompress
> 4.3 kernel _mtx_lock_spin turnstile_trywait:2.2 pmclog_reserve:1.0 ...
> 4.0 zfs.ko fletcher_4_native zio_checksum_error:3.1
> zio_checksum_compute:0.8
> 3.6 kernel cpu_search_lowest cpu_search_lowest
> 3.3 kernel nfsrc_trimcache nfsrvd_getcache:1.6 nfsrvd_updatecache:1.6
> 2.3 kernel ipfw_chk ipfw_check_hook
> 2.1 pmcstat _init
> 1.1 kernel _sx_xunlock
> 0.9 kernel _sx_xlock
> 0.9 kernel spinlock_exit
> 
> This does seem to confirm my original impression that the NFS replay
> cache is quite expensive. Running a gprof(1) analysis on the same PMC
> data reveals a bit more detail (I've removed some uninteresting parts
> of the call graph):
> 
I can't remember (I am early retired now;-) if I mentioned this patch before:
  http://people.freebsd.org/~rmacklem/drc.patch
It adds tunables vfs.nfsd.tcphighwater and vfs.nfsd.udphighwater that can
be twiddled so that the drc is trimmed less frequently. By making these
values larger, the trim will only happen once/sec until the high water
mark is reached, instead of on every RPC. The tradeoff is that the DRC will
become larger, but given memory sizes these days, that may be fine for you.

jwd@ was going to test it, but he moved to a different job away from NFS, so
the patch has just been collecting dust.

If you could test it, that would be nice, rick
ps: Also, the current patch still locks before checking if it needs to do the
    trim. I think that could safely be changed so that it doesn't lock/unlock
    when it isn't doing the trim, if that makes a significant difference.

> 
> called/total parents
> index %time self descendents called+self name index
> called/total children
> 4881.00 2004642.70 932627/932627 svc_run_internal [2]
> [4] 45.1 4881.00 2004642.70 932627 nfssvc_program [4]
> 13199.00 504436.33 584319/584319 nfsrvd_updatecache [9]
> 23075.00 403396.18 468009/468009 nfsrvd_getcache [14]
> 1032.25 416249.44 2239/2284 svc_sendreply_mbuf [15]
> 6168.00 381770.44 11618/11618 nfsrvd_dorpc [24]
> 3526.87 86869.88 112478/112514 nfsrvd_sentcache [74]
> 890.00 50540.89 4252/4252 svc_getcred [101]
> 14876.60 32394.26 4177/24500 crfree <cycle 3> [263]
> 11550.11 25150.73 3243/24500 free <cycle 3> [102]
> 1348.88 15451.66 2716/16831 m_freem [59]
> 4066.61 216.81 1434/1456 svc_freereq [321]
> 2342.15 677.40 557/1459 malloc_type_freed [265]
> 59.14 1916.84 134/2941 crget [113]
> 1602.25 0.00 322/9682 bzero [105]
> 690.93 0.00 43/44 getmicrotime [571]
> 287.22 7.33 138/1205 prison_free [384]
> 233.61 0.00 60/798 PHYS_TO_VM_PAGE [358]
> 203.12 0.00 94/230 nfsrv_mallocmget_limit [632]
> 151.76 0.00 51/1723 pmap_kextract [309]
> 0.78 70.28 9/3281 _mtx_unlock_sleep [154]
> 19.22 16.88 38/400403 nfsrc_trimcache [26]
> 11.05 21.74 7/197 crsetgroups [532]
> 30.37 0.00 11/6592 critical_enter [190]
> 25.50 0.00 9/36 turnstile_chain_unlock [844]
> 24.86 0.00 3/7 nfsd_errmap [913]
> 12.36 8.57 8/2145 in_cksum_skip [298]
> 9.10 3.59 5/12455 mb_free_ext [140]
> 1.84 4.85 2/2202 VOP_UNLOCK_APV [269]
> 
> -----------------------------------------------
> 
> 0.49 0.15 1/1129009 uhub_explore [1581]
> 0.49 0.15 1/1129009 tcp_output [10]
> 0.49 0.15 1/1129009 pmap_remove_all [1141]
> 0.49 0.15 1/1129009 vm_map_insert [236]
> 0.49 0.15 1/1129009 vnode_create_vobject [281]
> 0.49 0.15 1/1129009 biodone [351]
> 0.49 0.15 1/1129009 vm_object_madvise [670]
> 0.49 0.15 1/1129009 xpt_done [483]
> 0.49 0.15 1/1129009 vputx [80]
> 0.49 0.15 1/1129009 vm_map_delete <cycle 3> [49]
> 0.49 0.15 1/1129009 vm_object_deallocate <cycle 3> [356]
> 0.49 0.15 1/1129009 vm_page_unwire [338]
> 0.49 0.15 1/1129009 pmap_change_wiring [318]
> 0.98 0.31 2/1129009 getnewvnode [227]
> 0.98 0.31 2/1129009 pmap_clear_reference [1004]
> 0.98 0.31 2/1129009 usbd_do_request_flags [1282]
> 0.98 0.31 2/1129009 vm_object_collapse <cycle 3> [587]
> 0.98 0.31 2/1129009 vm_object_page_remove [122]
> 1.48 0.46 3/1129009 mpt_pci_intr [487]
> 1.48 0.46 3/1129009 pmap_extract [355]
> 1.48 0.46 3/1129009 vm_fault_unwire [171]
> 1.97 0.62 4/1129009 vgonel [270]
> 1.97 0.62 4/1129009 vm_object_shadow [926]
> 1.97 0.62 4/1129009 zone_alloc_item [434]
> 2.46 0.77 5/1129009 vnlru_free [235]
> 2.46 0.77 5/1129009 insmntque1 [737]
> 2.95 0.93 6/1129009 zone_free_item [409]
> 3.94 1.24 8/1129009 pmap_enter_object [805]
> 4.92 1.55 10/1129009 vfs_ref [722]
> 4.92 1.55 10/1129009 pmap_clear_modify [814]
> 5.41 1.70 11/1129009 vn_isdisk [391]
> 5.90 1.86 12/1129009 rtalloc1_fib [158]
> 5.90 1.86 12/1129009 soreceive_generic [99]
> 6.40 2.01 13/1129009 pmap_remove_pages [403]
> 6.89 2.16 14/1129009 _mtx_lock_flags [565]
> 6.89 2.16 14/1129009 random_yarrow_read [542]
> 6.89 2.16 14/1129009 nfsrv_checkgetattr [1133]
> 7.38 2.32 15/1129009 svc_vc_recv [55]
> 7.38 2.32 15/1129009 ixgbe_rxeof [12]
> 8.85 2.78 18/1129009 _cv_timedwait_sig [859]
> 9.35 2.94 19/1129009 vref [242]
> 10.82 3.40 22/1129009 pmap_enter_quick [690]
> 11.31 3.56 23/1129009 tcp_do_segment [35]
> 13.28 4.18 27/1129009 sosend_generic [25]
> 15.74 4.95 32/1129009 g_io_schedule_down [76]
> 19.19 6.03 39/1129009 vn_start_write [400]
> 20.66 6.49 42/1129009 ip_output [32]
> 23.12 7.27 47/1129009 pmap_copy [467]
> 23.12 7.27 47/1129009 __mnt_vnode_next_all [520]
> 29.52 9.28 60/1129009 vm_page_free_toq [149]
> 30.50 9.59 62/1129009 vm_fault_hold [268]
> 36.90 11.60 75/1129009 g_io_request [172]
> 38.37 12.06 78/1129009 prison_hold [330]
> 38.86 12.22 79/1129009 taskqueue_member [423]
> 39.36 12.37 80/1129009 vm_page_deactivate [812]
> 39.85 12.53 81/1129009 vm_page_activate [790]
> 40.83 12.83 83/1129009 in_matroute [197]
> 45.75 14.38 93/1129009 vm_page_alloc [128]
> 48.70 15.31 99/1129009 prison_free [384]
> 66.90 21.03 136/1129009 vholdl [275]
> 73.30 23.04 149/1129009 vfs_unbusy [470]
> 82.65 25.98 168/1129009 key_addref [233]
> 85.60 26.91 174/1129009 vn_finished_write [407]
> 91.50 28.76 186/1129009 _key_freesp [286]
> 92.48 29.07 188/1129009 vfs_busy [288]
> 102.32 32.16 208/1129009 _vm_object_allocate [738]
> 107.24 33.71 218/1129009 taskqueue_enqueue [68]
> 112.16 35.26 228/1129009 xprt_active [267]
> 125.94 39.59 256/1129009 vm_pageq_remove [650]
> 128.89 40.51 262/1129009 svc_freereq [321]
> 128.89 40.51 262/1129009 pmap_remove [106]
> 153.49 48.25 312/1129009 vdropl [251]
> 154.96 48.71 315/1129009 vfs_busyfs [239]
> 180.05 56.60 366/1129009 pmap_enter [245]
> 182.02 57.22 370/1129009 svc_run_internal [2]
> 213.50 67.11 434/1129009 ixgbe_msix_que [11]
> 248.43 78.09 505/1129009 key_allocsp [221]
> 255.32 80.26 519/1129009 camisr [107]
> 373.87 117.52 760/1129009 ipmi_poll [635]
> 404.87 127.27 823/1129009 g_io_deliver [162]
> 440.78 138.56 896/1129009 g_io_schedule_up [137]
> 462.91 145.51 941/1129009 taskqueue_run_locked [43]
> 744.30 233.97 1513/1129009 _sleep [150]
> 906.15 284.84 1842/1129009 dastrategy [116]
> 1170.32 367.88 2379/1129009 mps_intr_msi [194]
> 1329.22 417.83 2702/1129009 ixgbe_handle_que [56]
> 15078.43 4739.79 30651/1129009 nfsrc_unlock [174]
> 25869.62 8131.91 52587/1129009 _vm_map_lock [129]
> 44082.69 13857.05 89610/1129009 nfsrvd_sentcache [74]
> 113077.19 35544.94 229860/1129009 nfsrc_trimcache [26]
> 149533.34 47004.65 303967/1129009 nfsrvd_getcache [14]
> 198698.26 62459.27 403908/1129009 nfsrvd_updatecache [9]
> [5] 16.4 555404.00 174586.97 1129009 _mtx_lock_sleep [5]
> 9074.71 139536.06 87523/87603 turnstile_trywait [45]
> 3085.07 11049.12 3870/3964 turnstile_wait [207]
> 166.20 10020.76 4390/4728 lockstat_nsecs [229]
> 505.86 276.39 322/324 turnstile_cancel [552]
> 516.23 123.24 384/341743 spinlock_exit <cycle 1> [136]
> 161.51 0.00 61/9744 spinlock_enter [156]
> 44.18 0.00 16/6592 critical_enter [190]
> 23.20 0.00 4/35 turnstile_setowner [767]
> 1.35 3.10 1/1531 propagate_priority [280]
> 
> -----------------------------------------------
> 
> 13199.00 504436.33 584319/584319 nfssvc_program [4]
> [9] 11.6 13199.00 504436.33 584319 nfsrvd_updatecache [9]
> 198698.26 62459.27 403908/1129009 _mtx_lock_sleep [5]
> 107453.19 94387.86 212423/400403 nfsrc_trimcache [26]
> 1974.56 16386.18 2411/3646 nfsrc_freecache [148]
> 120.15 10877.79 1393/3281 _mtx_unlock_sleep [154]
> 139.67 10006.86 2/12 lapic_handle_intr [93]
> 257.17 946.79 431/6821 m_copym [186]
> 539.54 0.00 124/26922 bcopy [53]
> 2.07 31.89 20/87603 turnstile_trywait [45]
> 10.68 23.27 3/24500 free <cycle 3> [102]
> 25.54 6.10 19/341743 spinlock_exit <cycle 1> [136]
> 28.33 0.00 10/36 turnstile_chain_unlock [844]
> 13.58 10.56 8/30514 uma_zalloc_arg [70]
> 18.35 0.00 7/37 nfsrc_wanted [851]
> 10.90 0.83 5/284 turnstile_lookup [582]
> 0.11 6.85 3/4728 lockstat_nsecs [229]
> 
> -----------------------------------------------
> 
> 23075.00 403396.18 468009/468009 nfssvc_program [4]
> [14] 9.6 23075.00 403396.18 468009 nfsrvd_getcache [14]
> 149533.34 47004.65 303967/1129009 _mtx_lock_sleep [5]
> 95069.59 83510.00 187942/400403 nfsrc_trimcache [26]
> 1104.58 13395.38 993/8846 malloc [47]
> 117.22 10612.28 1359/3281 _mtx_unlock_sleep [154]
> 1570.18 1088.69 1016/2145 in_cksum_skip [298]
> 134.35 0.00 27/9682 bzero [105]
> 127.38 0.00 46/1196 in_cksumdata [347]
> 28.23 6.74 21/341743 spinlock_exit <cycle 1> [136]
> 1.87 28.70 18/87603 turnstile_trywait [45]
> 25.50 0.00 9/36 turnstile_chain_unlock [844]
> 11.88 9.24 7/30514 uma_zalloc_arg [70]
> 13.08 0.99 6/284 turnstile_lookup [582]
> 0.04 2.28 1/4728 lockstat_nsecs [229]
> 
> -----------------------------------------------
> 
> 19.22 16.88 38/400403 nfssvc_program [4]
> 95069.59 83510.00 187942/400403 nfsrvd_getcache [14]
> 107453.19 94387.86 212423/400403 nfsrvd_updatecache [9]
> [26] 8.5 202542.00 177914.74 400403 nfsrc_trimcache [26]
> 113077.19 35544.94 229860/1129009 _mtx_lock_sleep [5]
> 657.00 19128.82 5754/5754 nfsrv_checksockseqnum [183]
> 1011.44 8393.58 1235/3646 nfsrc_freecache [148]
> 14.25 31.02 4/24500 free <cycle 3> [102]
> 2.48 28.45 5/16831 m_freem [59]
> 0.93 14.35 9/87603 turnstile_trywait [45]
> 7.86 0.00 3/37 nfsrc_wanted [851]
> 1.57 0.86 1/324 turnstile_cancel [552]
> 
> -----------------------------------------------
> 
> The full gprof output for the kernel is at
> <http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.txt>.
> An amazing visualization of this (thanks to gprof2dot.py and dot) is
> at
> <http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png>.
> There are some functions missing from the visualization because the
> gprof buckets overflowed.
> 
> Is there a documented way to merge the kernel and ZFS gprof data
> output by pmcstat(1) so that those functions will be properly
> accounted for?
> 
> -GAWollman



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?499414315.1544891.1349180909058.JavaMail.root>