Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Oct 2012 00:21:18 -0400
From:      Garrett Wollman <wollman@freebsd.org>
To:        hackers@freebsd.org
Cc:        rmacklem@freebsd.org, jkoshy@freebsd.org
Subject:   NFS server bottlenecks
Message-ID:  <20586.27582.478147.838896@hergotha.csail.mit.edu>

next in thread | raw e-mail | index | archive | help
I had an email conversation with Rick Macklem about six months ago
about NFS server bottlenecks.  I'm now in a position to observe my
large-scale NFS server under an actual production load, so I thought I
would update folks on what it looks like.  This is a 9.1 prerelease
kernel (I hope 9.1 will be released soon as I have four moe of these
servers to deploy!).  When under nearly 100% load on an 8-core
(16-thread) Quanta QSSC-S99Q storage server, with a 10G network
interface, pmcstat tells me this:

PMC: [INST_RETIRED.ANY_P] Samples: 2727105 (100.0%) , 27 unresolved
Key: q => exiting...
%SAMP IMAGE      FUNCTION             CALLERS
 29.3 kernel     _mtx_lock_sleep      nfsrvd_updatecache:10.0 nfsrvd_getcache:7.4 ...
  9.5 kernel     cpu_search_highest   cpu_search_highest:8.1 sched_idletd:1.4
  7.4 zfs.ko     lzjb_decompress      zio_decompress
  4.3 kernel     _mtx_lock_spin       turnstile_trywait:2.2 pmclog_reserve:1.0 ...
  4.0 zfs.ko     fletcher_4_native    zio_checksum_error:3.1 zio_checksum_compute:0.8
  3.6 kernel     cpu_search_lowest    cpu_search_lowest
  3.3 kernel     nfsrc_trimcache      nfsrvd_getcache:1.6 nfsrvd_updatecache:1.6
  2.3 kernel     ipfw_chk             ipfw_check_hook
  2.1 pmcstat    _init
  1.1 kernel     _sx_xunlock
  0.9 kernel     _sx_xlock
  0.9 kernel     spinlock_exit

This does seem to confirm my original impression that the NFS replay
cache is quite expensive.  Running a gprof(1) analysis on the same PMC
data reveals a bit more detail (I've removed some uninteresting parts
of the call graph):


                                  called/total       parents 
index  %time    self descendents  called+self    name           index
                                  called/total       children
             4881.00  2004642.70  932627/932627      svc_run_internal [2]
[4]     45.1 4881.00  2004642.70  932627         nfssvc_program [4]
             13199.00   504436.33  584319/584319      nfsrvd_updatecache [9]
             23075.00   403396.18  468009/468009      nfsrvd_getcache [14]
             1032.25   416249.44    2239/2284        svc_sendreply_mbuf [15]
             6168.00   381770.44   11618/11618       nfsrvd_dorpc [24]
             3526.87    86869.88  112478/112514      nfsrvd_sentcache [74]
              890.00    50540.89    4252/4252        svc_getcred [101]
             14876.60    32394.26    4177/24500       crfree <cycle 3> [263]
             11550.11    25150.73    3243/24500       free <cycle 3> [102]
             1348.88    15451.66    2716/16831       m_freem [59]
             4066.61      216.81    1434/1456        svc_freereq [321]
             2342.15      677.40     557/1459        malloc_type_freed [265]
               59.14     1916.84     134/2941        crget [113]
             1602.25        0.00     322/9682        bzero [105]
              690.93        0.00      43/44          getmicrotime [571]
              287.22        7.33     138/1205        prison_free [384]
              233.61        0.00      60/798         PHYS_TO_VM_PAGE [358]
              203.12        0.00      94/230         nfsrv_mallocmget_limit [632]
              151.76        0.00      51/1723        pmap_kextract [309]
                0.78       70.28       9/3281        _mtx_unlock_sleep [154]
               19.22       16.88      38/400403      nfsrc_trimcache [26]
               11.05       21.74       7/197         crsetgroups [532]
               30.37        0.00      11/6592        critical_enter [190]
               25.50        0.00       9/36          turnstile_chain_unlock [844]
               24.86        0.00       3/7           nfsd_errmap [913]
               12.36        8.57       8/2145        in_cksum_skip [298]
                9.10        3.59       5/12455       mb_free_ext [140]
                1.84        4.85       2/2202        VOP_UNLOCK_APV [269]

-----------------------------------------------

                0.49        0.15       1/1129009     uhub_explore [1581]
                0.49        0.15       1/1129009     tcp_output [10]
                0.49        0.15       1/1129009     pmap_remove_all [1141]
                0.49        0.15       1/1129009     vm_map_insert [236]
                0.49        0.15       1/1129009     vnode_create_vobject [281]
                0.49        0.15       1/1129009     biodone [351]
                0.49        0.15       1/1129009     vm_object_madvise [670]
                0.49        0.15       1/1129009     xpt_done [483]
                0.49        0.15       1/1129009     vputx [80]
                0.49        0.15       1/1129009     vm_map_delete <cycle 3> [49]
                0.49        0.15       1/1129009     vm_object_deallocate <cycle 3> [356]
                0.49        0.15       1/1129009     vm_page_unwire [338]
                0.49        0.15       1/1129009     pmap_change_wiring [318]
                0.98        0.31       2/1129009     getnewvnode [227]
                0.98        0.31       2/1129009     pmap_clear_reference [1004]
                0.98        0.31       2/1129009     usbd_do_request_flags [1282]
                0.98        0.31       2/1129009     vm_object_collapse <cycle 3> [587]
                0.98        0.31       2/1129009     vm_object_page_remove [122]
                1.48        0.46       3/1129009     mpt_pci_intr [487]
                1.48        0.46       3/1129009     pmap_extract [355]
                1.48        0.46       3/1129009     vm_fault_unwire [171]
                1.97        0.62       4/1129009     vgonel [270]
                1.97        0.62       4/1129009     vm_object_shadow [926]
                1.97        0.62       4/1129009     zone_alloc_item [434]
                2.46        0.77       5/1129009     vnlru_free [235]
                2.46        0.77       5/1129009     insmntque1 [737]
                2.95        0.93       6/1129009     zone_free_item [409]
                3.94        1.24       8/1129009     pmap_enter_object [805]
                4.92        1.55      10/1129009     vfs_ref [722]
                4.92        1.55      10/1129009     pmap_clear_modify [814]
                5.41        1.70      11/1129009     vn_isdisk [391]
                5.90        1.86      12/1129009     rtalloc1_fib [158]
                5.90        1.86      12/1129009     soreceive_generic [99]
                6.40        2.01      13/1129009     pmap_remove_pages [403]
                6.89        2.16      14/1129009     _mtx_lock_flags [565]
                6.89        2.16      14/1129009     random_yarrow_read [542]
                6.89        2.16      14/1129009     nfsrv_checkgetattr [1133]
                7.38        2.32      15/1129009     svc_vc_recv [55]
                7.38        2.32      15/1129009     ixgbe_rxeof [12]
                8.85        2.78      18/1129009     _cv_timedwait_sig [859]
                9.35        2.94      19/1129009     vref [242]
               10.82        3.40      22/1129009     pmap_enter_quick [690]
               11.31        3.56      23/1129009     tcp_do_segment [35]
               13.28        4.18      27/1129009     sosend_generic [25]
               15.74        4.95      32/1129009     g_io_schedule_down [76]
               19.19        6.03      39/1129009     vn_start_write [400]
               20.66        6.49      42/1129009     ip_output [32]
               23.12        7.27      47/1129009     pmap_copy [467]
               23.12        7.27      47/1129009     __mnt_vnode_next_all [520]
               29.52        9.28      60/1129009     vm_page_free_toq [149]
               30.50        9.59      62/1129009     vm_fault_hold [268]
               36.90       11.60      75/1129009     g_io_request [172]
               38.37       12.06      78/1129009     prison_hold [330]
               38.86       12.22      79/1129009     taskqueue_member [423]
               39.36       12.37      80/1129009     vm_page_deactivate [812]
               39.85       12.53      81/1129009     vm_page_activate [790]
               40.83       12.83      83/1129009     in_matroute [197]
               45.75       14.38      93/1129009     vm_page_alloc [128]
               48.70       15.31      99/1129009     prison_free [384]
               66.90       21.03     136/1129009     vholdl [275]
               73.30       23.04     149/1129009     vfs_unbusy [470]
               82.65       25.98     168/1129009     key_addref [233]
               85.60       26.91     174/1129009     vn_finished_write [407]
               91.50       28.76     186/1129009     _key_freesp [286]
               92.48       29.07     188/1129009     vfs_busy [288]
              102.32       32.16     208/1129009     _vm_object_allocate [738]
              107.24       33.71     218/1129009     taskqueue_enqueue [68]
              112.16       35.26     228/1129009     xprt_active [267]
              125.94       39.59     256/1129009     vm_pageq_remove [650]
              128.89       40.51     262/1129009     svc_freereq [321]
              128.89       40.51     262/1129009     pmap_remove [106]
              153.49       48.25     312/1129009     vdropl [251]
              154.96       48.71     315/1129009     vfs_busyfs [239]
              180.05       56.60     366/1129009     pmap_enter [245]
              182.02       57.22     370/1129009     svc_run_internal [2]
              213.50       67.11     434/1129009     ixgbe_msix_que [11]
              248.43       78.09     505/1129009     key_allocsp [221]
              255.32       80.26     519/1129009     camisr [107]
              373.87      117.52     760/1129009     ipmi_poll [635]
              404.87      127.27     823/1129009     g_io_deliver [162]
              440.78      138.56     896/1129009     g_io_schedule_up [137]
              462.91      145.51     941/1129009     taskqueue_run_locked [43]
              744.30      233.97    1513/1129009     _sleep [150]
              906.15      284.84    1842/1129009     dastrategy [116]
             1170.32      367.88    2379/1129009     mps_intr_msi [194]
             1329.22      417.83    2702/1129009     ixgbe_handle_que [56]
             15078.43     4739.79   30651/1129009     nfsrc_unlock [174]
             25869.62     8131.91   52587/1129009     _vm_map_lock [129]
             44082.69    13857.05   89610/1129009     nfsrvd_sentcache [74]
             113077.19    35544.94  229860/1129009     nfsrc_trimcache [26]
             149533.34    47004.65  303967/1129009     nfsrvd_getcache [14]
             198698.26    62459.27  403908/1129009     nfsrvd_updatecache [9]
[5]     16.4 555404.00   174586.97 1129009         _mtx_lock_sleep [5]
             9074.71   139536.06   87523/87603       turnstile_trywait [45]
             3085.07    11049.12    3870/3964        turnstile_wait [207]
              166.20    10020.76    4390/4728        lockstat_nsecs [229]
              505.86      276.39     322/324         turnstile_cancel [552]
              516.23      123.24     384/341743      spinlock_exit <cycle 1> [136]
              161.51        0.00      61/9744        spinlock_enter [156]
               44.18        0.00      16/6592        critical_enter [190]
               23.20        0.00       4/35          turnstile_setowner [767]
                1.35        3.10       1/1531        propagate_priority [280]

-----------------------------------------------

             13199.00   504436.33  584319/584319      nfssvc_program [4]
[9]     11.6 13199.00   504436.33  584319         nfsrvd_updatecache [9]
             198698.26    62459.27  403908/1129009     _mtx_lock_sleep [5]
             107453.19    94387.86  212423/400403      nfsrc_trimcache [26]
             1974.56    16386.18    2411/3646        nfsrc_freecache [148]
              120.15    10877.79    1393/3281        _mtx_unlock_sleep [154]
              139.67    10006.86       2/12          lapic_handle_intr [93]
              257.17      946.79     431/6821        m_copym [186]
              539.54        0.00     124/26922       bcopy [53]
                2.07       31.89      20/87603       turnstile_trywait [45]
               10.68       23.27       3/24500       free <cycle 3> [102]
               25.54        6.10      19/341743      spinlock_exit <cycle 1> [136]
               28.33        0.00      10/36          turnstile_chain_unlock [844]
               13.58       10.56       8/30514       uma_zalloc_arg [70]
               18.35        0.00       7/37          nfsrc_wanted [851]
               10.90        0.83       5/284         turnstile_lookup [582]
                0.11        6.85       3/4728        lockstat_nsecs [229]

-----------------------------------------------

             23075.00   403396.18  468009/468009      nfssvc_program [4]
[14]     9.6 23075.00   403396.18  468009         nfsrvd_getcache [14]
             149533.34    47004.65  303967/1129009     _mtx_lock_sleep [5]
             95069.59    83510.00  187942/400403      nfsrc_trimcache [26]
             1104.58    13395.38     993/8846        malloc [47]
              117.22    10612.28    1359/3281        _mtx_unlock_sleep [154]
             1570.18     1088.69    1016/2145        in_cksum_skip [298]
              134.35        0.00      27/9682        bzero [105]
              127.38        0.00      46/1196        in_cksumdata [347]
               28.23        6.74      21/341743      spinlock_exit <cycle 1> [136]
                1.87       28.70      18/87603       turnstile_trywait [45]
               25.50        0.00       9/36          turnstile_chain_unlock [844]
               11.88        9.24       7/30514       uma_zalloc_arg [70]
               13.08        0.99       6/284         turnstile_lookup [582]
                0.04        2.28       1/4728        lockstat_nsecs [229]

-----------------------------------------------

               19.22       16.88      38/400403      nfssvc_program [4]
             95069.59    83510.00  187942/400403      nfsrvd_getcache [14]
             107453.19    94387.86  212423/400403      nfsrvd_updatecache [9]
[26]     8.5 202542.00   177914.74  400403         nfsrc_trimcache [26]
             113077.19    35544.94  229860/1129009     _mtx_lock_sleep [5]
              657.00    19128.82    5754/5754        nfsrv_checksockseqnum [183]
             1011.44     8393.58    1235/3646        nfsrc_freecache [148]
               14.25       31.02       4/24500       free <cycle 3> [102]
                2.48       28.45       5/16831       m_freem [59]
                0.93       14.35       9/87603       turnstile_trywait [45]
                7.86        0.00       3/37          nfsrc_wanted [851]
                1.57        0.86       1/324         turnstile_cancel [552]

-----------------------------------------------

The full gprof output for the kernel is at
<http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.txt>.
An amazing visualization of this (thanks to gprof2dot.py and dot) is at
<http://people.csail.mit.edu/wollman/nfs-server.unhalted-core-cycles.png>.
There are some functions missing from the visualization because the
gprof buckets overflowed.

Is there a documented way to merge the kernel and ZFS gprof data
output by pmcstat(1) so that those functions will be properly
accounted for?

-GAWollman



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20586.27582.478147.838896>