From owner-freebsd-fs@FreeBSD.ORG Tue Feb 19 20:11:03 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id C5BDB915 for ; Tue, 19 Feb 2013 20:11:03 +0000 (UTC) (envelope-from toasty@dragondata.com) Received: from mail-ia0-x22a.google.com (ia-in-x022a.1e100.net [IPv6:2607:f8b0:4001:c02::22a]) by mx1.freebsd.org (Postfix) with ESMTP id 8EBD33A6 for ; Tue, 19 Feb 2013 20:11:03 +0000 (UTC) Received: by mail-ia0-f170.google.com with SMTP id k20so6532308iak.1 for ; Tue, 19 Feb 2013 12:11:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dragondata.com; s=google; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer; bh=UPWosEXkGGWzgnU5mq7j1BCBDV6yLJmc4XG/ifsHxns=; b=jTlrkjDXCD1SqoNQ1ErD4Dkn3S8Ik8GaRBTU8//zyRMpbnNKuy4yfs4wxl8gs8ziE/ Tq02ZxZHGXP3+N/fHRtlWMUBmE3iZRvTFvewkNQbv1R2T5RO8mozbuXzPAa7dqWF4+/u WfAIt+DES0Qu4HSbSFsk/CiuWpMAkHNGEA/fw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:content-type:mime-version:subject:from:in-reply-to:date :cc:content-transfer-encoding:message-id:references:to:x-mailer :x-gm-message-state; bh=UPWosEXkGGWzgnU5mq7j1BCBDV6yLJmc4XG/ifsHxns=; b=HgSrQameUHyhRey89xAkrEFisg/inK+++toH/2qEmdAUCiZXDWCiP7b2e9myhyF4XS oi7IWRhFz1N6aN+XqyQcRuWffK4Kr3GxjnQzeFz9MdQHz6914XZgHPfnSyEULTbiBUDM pdV2JgHIZAML0lclHMuzTEPULzXvzZUKWOMhoMvdHZx702UEn2hMQBhjeZjwrBBP1PPG 7rDV5zLUJS8wbRXhwusEObOo4ZUG+q7OYVowe21S13Xx+kiz//lbvmWC8wj1owuaNW/b Y9FdEl23iENArnGrWs+FLs3qTaxYZQn+LvFQ033yilrPJfmT6P2QnsEyy54jgtejbXTa gjqw== X-Received: by 10.50.196.165 with SMTP id in5mr9855938igc.99.1361304653771; Tue, 19 Feb 2013 12:10:53 -0800 (PST) Received: from vpn132.rw1.your.org (vpn132.rw1.your.org. [204.9.51.132]) by mx.google.com with ESMTPS id ip8sm4053976igc.4.2013.02.19.12.10.50 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 19 Feb 2013 12:10:52 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Improving ZFS performance for large directories From: Kevin Day In-Reply-To: <20130201192416.GA76461@server.rulingia.com> Date: Tue, 19 Feb 2013 14:10:47 -0600 Content-Transfer-Encoding: quoted-printable Message-Id: <19E0C908-79F1-43F8-899C-6B60F998D4A5@dragondata.com> References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> <20130201192416.GA76461@server.rulingia.com> To: Peter Jeremy X-Mailer: Apple Mail (2.1499) X-Gm-Message-State: ALoCoQkU2VCvT+jkKziLCBqbIYtrE5fbjJqPe+T3S++ADtUBZdXWe0nfe9VdbxmF1I4DTgoPuVRq Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Feb 2013 20:11:03 -0000 Sorry for the late followup, I've been doing some testing with an L2ARC = device. >> Doing it twice back-to-back makes a bit of difference but it's still = slow either way. >=20 > ZFS can very conservative about caching data and twice might not be = enough. > I suggest you try 8-10 times, or until the time stops reducing. >=20 Timing doing an "ls" in large directories 20 times, the first is the = slowest, then all subsequent listings are roughly the same. There = doesn't appear to be any gain after 20 repetitions=20 >> I think some of the issue is that nothing is being allowed to stay = cached long. >=20 > Well ZFS doesn't do any time-based eviction so if things aren't > staying in the cache, it's because they are being evicted by things > that ZFS considers more deserving. >=20 > Looking at the zfs-stats you posted, it looks like your workload has > very low locality of reference (the data hitrate is very) low. If > this is not what you expect then you need more RAM. OTOH, your > vfs.zfs.arc_meta_used being above vfs.zfs.arc_meta_limit suggests that > ZFS really wants to cache more metadata (by default ZFS has a 25% > metadata, 75% data split in ARC to prevent metadata caching starving > data caching). I would go even further than the 50:50 split suggested > later and try 75:25 (ie, triple the current vfs.zfs.arc_meta_limit). >=20 > Note that if there is basically no locality of reference in your > workload (as I suspect), you can even turn off data caching for > specific filesystems with zfs set primarycache=3Dmetadata tank/foo > (note that you still need to increase vfs.zfs.arc_meta_limit to > allow ZFS to use the the ARC to cache metadata). Now that I've got an L2ARC device (250GB), I've been doing some playing. = With the defaults (primarycache and secondarycache set to all), I really = didn't see much improvement. The SSD filled itself pretty quickly, but = it's hit rate was around 1%, even after 48 hours. Thinking I'd make the primary cache metadata only, and the secondary = cache "all" would improve things, I wiped the device (SATA secure erase = to make sure) and tried again. This was much worse, i'm guessing because = there was some amount of real file data being looked at frequently, the = SSD was basically getting hammered for read access with 100% = utilization, and things were far slower. I wiped the SSD and tried again with primarycache=3Dall, = secondarycache=3Dmetadata and things have improved. Even with boosting = up vfs.zfs.l2arc_write_max, it took quite a while before things = stabilized. I'm guessing there isn't a huge amount of data, but there's = such poor locality and sweeping the entire filesystem takes so long that = it's going to take a while before it decides what's worth being cached. = After about 20 hours in this configuration, it's a HUGE difference on = directory speeds though. Before adding the SSD, an "ls" in a directory = with 65k files would take 10-30 seconds, it's now down to about 0.2 = seconds.=20 So I'm guessing the theory was right, there was more metadata than would = fit in ARC so it was constantly churning. I'm a bit surprised that = continually doing an ls in a big directory didn't make it stick better, = but these filesystems are HUGE so there may be some inefficiencies = happening here. There are roughly 29M files, growing at about 50k = files/day. We recently upgraded, and are now at 96 3TB drives in the = pool. What I also find surprising is this: L2 ARC Size: (Adaptive) 22.70 GiB Header Size: 0.31% 71.49 MiB L2 ARC Breakdown: 23.77m Hit Ratio: 34.26% 8.14m Miss Ratio: 65.74% 15.62m Feeds: 63.28k It's a 250G drive, and only 22G is being used, and there's still a ~66% = miss rate. Is there any way to tell why more metadata isn't being pushed = to the L2ARC? I see a pretty high count for "Passed Headroom" and "Tried = Lock Failures", but I'm not sure if that's normal. Including the = lengthy output of zfs-stat below in case anyone sees something that = stands out as being unusual.=20 ------------------------------------------------------------------------ ZFS Subsystem Report Tue Feb 19 20:08:19 2013 ------------------------------------------------------------------------ System Information: Kernel Version: 901000 (osreldate) Hardware Platform: amd64 Processor Architecture: amd64 ZFS Storage pool Version: 28 ZFS Filesystem Version: 5 FreeBSD 9.1-RC2 #1: Tue Oct 30 20:37:38 UTC 2012 root 8:08PM up 20:40, 3 users, load averages: 0.47, 0.50, 0.52 ------------------------------------------------------------------------ System Memory: 8.41% 5.22 GiB Active, 10.18% 6.32 GiB Inact 77.39% 48.05 GiB Wired, 1.52% 966.99 MiB Cache 2.50% 1.55 GiB Free, 0.00% 888.00 KiB Gap Real Installed: 64.00 GiB Real Available: 99.97% 63.98 GiB Real Managed: 97.04% 62.08 GiB Logical Total: 64.00 GiB Logical Used: 86.22% 55.18 GiB Logical Free: 13.78% 8.82 GiB Kernel Memory: 23.18 GiB Data: 99.91% 23.16 GiB Text: 0.09% 21.27 MiB Kernel Memory Map: 52.10 GiB Size: 35.21% 18.35 GiB Free: 64.79% 33.75 GiB ------------------------------------------------------------------------ ARC Summary: (HEALTHY) Memory Throttle Count: 0 ARC Misc: Deleted: 10.24m Recycle Misses: 3.48m Mutex Misses: 24.85k Evict Skips: 12.79m ARC Size: 92.50% 28.25 GiB Target Size: (Adaptive) 92.50% 28.25 GiB Min Size (Hard Limit): 25.00% 7.64 GiB Max Size (High Water): 4:1 30.54 GiB ARC Size Breakdown: Recently Used Cache Size: 62.35% 17.62 GiB Frequently Used Cache Size: 37.65% 10.64 GiB ARC Hash Breakdown: Elements Max: 1.99m Elements Current: 99.16% 1.98m Collisions: 8.97m Chain Max: 14 Chains: 586.97k ------------------------------------------------------------------------ ARC Efficiency: 1.15b Cache Hit Ratio: 97.66% 1.12b Cache Miss Ratio: 2.34% 26.80m Actual Hit Ratio: 72.75% 833.30m Data Demand Efficiency: 98.39% 33.94m Data Prefetch Efficiency: 8.11% 7.60m CACHE HITS BY CACHE LIST: Anonymously Used: 23.88% 267.15m Most Recently Used: 4.70% 52.60m Most Frequently Used: 69.79% 780.70m Most Recently Used Ghost: 0.64% 7.13m Most Frequently Used Ghost: 0.98% 10.99m CACHE HITS BY DATA TYPE: Demand Data: 2.99% 33.40m Prefetch Data: 0.06% 616.42k Demand Metadata: 71.38% 798.44m Prefetch Metadata: 25.58% 286.13m CACHE MISSES BY DATA TYPE: Demand Data: 2.04% 546.67k Prefetch Data: 26.07% 6.99m Demand Metadata: 37.96% 10.18m Prefetch Metadata: 33.93% 9.09m ------------------------------------------------------------------------ L2 ARC Summary: (HEALTHY) Passed Headroom: 3.62m Tried Lock Failures: 3.17m IO In Progress: 21.18k Low Memory Aborts: 20 Free on Write: 7.07k Writes While Full: 134 R/W Clashes: 1.63k Bad Checksums: 0 IO Errors: 0 SPA Mismatch: 0 L2 ARC Size: (Adaptive) 22.70 GiB Header Size: 0.31% 71.02 MiB L2 ARC Breakdown: 23.78m Hit Ratio: 34.25% 8.15m Miss Ratio: 65.75% 15.64m Feeds: 63.47k L2 ARC Buffer: Bytes Scanned: 65.51 TiB Buffer Iterations: 63.47k List Iterations: 4.06m NULL List Iterations: 64.89k L2 ARC Writes: Writes Sent: 100.00% 29.89k ------------------------------------------------------------------------ File-Level Prefetch: (HEALTHY) DMU Efficiency: 1.24b Hit Ratio: 64.29% 798.62m Miss Ratio: 35.71% 443.54m Colinear: 443.54m Hit Ratio: 0.00% 20.45k Miss Ratio: 100.00% 443.52m Stride: 772.29m Hit Ratio: 99.99% 772.21m Miss Ratio: 0.01% 81.30k DMU Misc: Reclaim: 443.52m Successes: 0.05% 220.47k Failures: 99.95% 443.30m Streams: 26.42m +Resets: 0.05% 12.73k -Resets: 99.95% 26.41m Bogus: 0 ------------------------------------------------------------------------ VDEV cache is disabled ------------------------------------------------------------------------ ZFS Tunables (sysctl): kern.maxusers 384 vm.kmem_size 66662760448 vm.kmem_size_scale 1 vm.kmem_size_min 0 vm.kmem_size_max 329853485875 vfs.zfs.l2c_only_size 5242113536 vfs.zfs.mfu_ghost_data_lsize 178520064 vfs.zfs.mfu_ghost_metadata_lsize 6486959104 vfs.zfs.mfu_ghost_size 6665479168 vfs.zfs.mfu_data_lsize 11863127552 vfs.zfs.mfu_metadata_lsize 123386368 vfs.zfs.mfu_size 12432947200 vfs.zfs.mru_ghost_data_lsize 14095171584 vfs.zfs.mru_ghost_metadata_lsize 8351076864 vfs.zfs.mru_ghost_size 22446248448 vfs.zfs.mru_data_lsize 2076449280 vfs.zfs.mru_metadata_lsize 4655490560 vfs.zfs.mru_size 7074721792 vfs.zfs.anon_data_lsize 0 vfs.zfs.anon_metadata_lsize 0 vfs.zfs.anon_size 1605632 vfs.zfs.l2arc_norw 1 vfs.zfs.l2arc_feed_again 1 vfs.zfs.l2arc_noprefetch 1 vfs.zfs.l2arc_feed_min_ms 200 vfs.zfs.l2arc_feed_secs 1 vfs.zfs.l2arc_headroom 2 vfs.zfs.l2arc_write_boost 52428800 vfs.zfs.l2arc_write_max 26214400 vfs.zfs.arc_meta_limit 16398159872 vfs.zfs.arc_meta_used 16398120264 vfs.zfs.arc_min 8199079936 vfs.zfs.arc_max 32796319744 vfs.zfs.dedup.prefetch 1 vfs.zfs.mdcomp_disable 0 vfs.zfs.write_limit_override 0 vfs.zfs.write_limit_inflated 206088929280 vfs.zfs.write_limit_max 8587038720 vfs.zfs.write_limit_min 33554432 vfs.zfs.write_limit_shift 3 vfs.zfs.no_write_throttle 0 vfs.zfs.zfetch.array_rd_sz 1048576 vfs.zfs.zfetch.block_cap 256 vfs.zfs.zfetch.min_sec_reap 2 vfs.zfs.zfetch.max_streams 8 vfs.zfs.prefetch_disable 0 vfs.zfs.mg_alloc_failures 12 vfs.zfs.check_hostid 1 vfs.zfs.recover 0 vfs.zfs.txg.synctime_ms 1000 vfs.zfs.txg.timeout 5 vfs.zfs.vdev.cache.bshift 16 vfs.zfs.vdev.cache.size 0 vfs.zfs.vdev.cache.max 16384 vfs.zfs.vdev.write_gap_limit 4096 vfs.zfs.vdev.read_gap_limit 32768 vfs.zfs.vdev.aggregation_limit 131072 vfs.zfs.vdev.ramp_rate 2 vfs.zfs.vdev.time_shift 6 vfs.zfs.vdev.min_pending 4 vfs.zfs.vdev.max_pending 128 vfs.zfs.vdev.bio_flush_disable 0 vfs.zfs.cache_flush_disable 0 vfs.zfs.zil_replay_disable 0 vfs.zfs.zio.use_uma 0 vfs.zfs.snapshot_list_prefetch 0 vfs.zfs.version.zpl 5 vfs.zfs.version.spa 28 vfs.zfs.version.acl 1 vfs.zfs.debug 0 vfs.zfs.super_owner 0