Date: Tue, 19 Feb 2013 14:10:47 -0600 From: Kevin Day <toasty@dragondata.com> To: Peter Jeremy <peter@rulingia.com> Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org> Subject: Re: Improving ZFS performance for large directories Message-ID: <19E0C908-79F1-43F8-899C-6B60F998D4A5@dragondata.com> In-Reply-To: <20130201192416.GA76461@server.rulingia.com> References: <19DB8F4A-6788-44F6-9A2C-E01DEA01BED9@dragondata.com> <CAJjvXiE%2B8OMu_yvdRAsWugH7W=fhFW7bicOLLyjEn8YrgvCwiw@mail.gmail.com> <F4420A8C-FB92-4771-B261-6C47A736CF7F@dragondata.com> <20130201192416.GA76461@server.rulingia.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Sorry for the late followup, I've been doing some testing with an L2ARC device. >> Doing it twice back-to-back makes a bit of difference but it's still slow either way. > > ZFS can very conservative about caching data and twice might not be enough. > I suggest you try 8-10 times, or until the time stops reducing. > Timing doing an "ls" in large directories 20 times, the first is the slowest, then all subsequent listings are roughly the same. There doesn't appear to be any gain after 20 repetitions >> I think some of the issue is that nothing is being allowed to stay cached long. > > Well ZFS doesn't do any time-based eviction so if things aren't > staying in the cache, it's because they are being evicted by things > that ZFS considers more deserving. > > Looking at the zfs-stats you posted, it looks like your workload has > very low locality of reference (the data hitrate is very) low. If > this is not what you expect then you need more RAM. OTOH, your > vfs.zfs.arc_meta_used being above vfs.zfs.arc_meta_limit suggests that > ZFS really wants to cache more metadata (by default ZFS has a 25% > metadata, 75% data split in ARC to prevent metadata caching starving > data caching). I would go even further than the 50:50 split suggested > later and try 75:25 (ie, triple the current vfs.zfs.arc_meta_limit). > > Note that if there is basically no locality of reference in your > workload (as I suspect), you can even turn off data caching for > specific filesystems with zfs set primarycache=metadata tank/foo > (note that you still need to increase vfs.zfs.arc_meta_limit to > allow ZFS to use the the ARC to cache metadata). Now that I've got an L2ARC device (250GB), I've been doing some playing. With the defaults (primarycache and secondarycache set to all), I really didn't see much improvement. The SSD filled itself pretty quickly, but it's hit rate was around 1%, even after 48 hours. Thinking I'd make the primary cache metadata only, and the secondary cache "all" would improve things, I wiped the device (SATA secure erase to make sure) and tried again. This was much worse, i'm guessing because there was some amount of real file data being looked at frequently, the SSD was basically getting hammered for read access with 100% utilization, and things were far slower. I wiped the SSD and tried again with primarycache=all, secondarycache=metadata and things have improved. Even with boosting up vfs.zfs.l2arc_write_max, it took quite a while before things stabilized. I'm guessing there isn't a huge amount of data, but there's such poor locality and sweeping the entire filesystem takes so long that it's going to take a while before it decides what's worth being cached. After about 20 hours in this configuration, it's a HUGE difference on directory speeds though. Before adding the SSD, an "ls" in a directory with 65k files would take 10-30 seconds, it's now down to about 0.2 seconds. So I'm guessing the theory was right, there was more metadata than would fit in ARC so it was constantly churning. I'm a bit surprised that continually doing an ls in a big directory didn't make it stick better, but these filesystems are HUGE so there may be some inefficiencies happening here. There are roughly 29M files, growing at about 50k files/day. We recently upgraded, and are now at 96 3TB drives in the pool. What I also find surprising is this: L2 ARC Size: (Adaptive) 22.70 GiB Header Size: 0.31% 71.49 MiB L2 ARC Breakdown: 23.77m Hit Ratio: 34.26% 8.14m Miss Ratio: 65.74% 15.62m Feeds: 63.28k It's a 250G drive, and only 22G is being used, and there's still a ~66% miss rate. Is there any way to tell why more metadata isn't being pushed to the L2ARC? I see a pretty high count for "Passed Headroom" and "Tried Lock Failures", but I'm not sure if that's normal. Including the lengthy output of zfs-stat below in case anyone sees something that stands out as being unusual. ------------------------------------------------------------------------ ZFS Subsystem Report Tue Feb 19 20:08:19 2013 ------------------------------------------------------------------------ System Information: Kernel Version: 901000 (osreldate) Hardware Platform: amd64 Processor Architecture: amd64 ZFS Storage pool Version: 28 ZFS Filesystem Version: 5 FreeBSD 9.1-RC2 #1: Tue Oct 30 20:37:38 UTC 2012 root 8:08PM up 20:40, 3 users, load averages: 0.47, 0.50, 0.52 ------------------------------------------------------------------------ System Memory: 8.41% 5.22 GiB Active, 10.18% 6.32 GiB Inact 77.39% 48.05 GiB Wired, 1.52% 966.99 MiB Cache 2.50% 1.55 GiB Free, 0.00% 888.00 KiB Gap Real Installed: 64.00 GiB Real Available: 99.97% 63.98 GiB Real Managed: 97.04% 62.08 GiB Logical Total: 64.00 GiB Logical Used: 86.22% 55.18 GiB Logical Free: 13.78% 8.82 GiB Kernel Memory: 23.18 GiB Data: 99.91% 23.16 GiB Text: 0.09% 21.27 MiB Kernel Memory Map: 52.10 GiB Size: 35.21% 18.35 GiB Free: 64.79% 33.75 GiB ------------------------------------------------------------------------ ARC Summary: (HEALTHY) Memory Throttle Count: 0 ARC Misc: Deleted: 10.24m Recycle Misses: 3.48m Mutex Misses: 24.85k Evict Skips: 12.79m ARC Size: 92.50% 28.25 GiB Target Size: (Adaptive) 92.50% 28.25 GiB Min Size (Hard Limit): 25.00% 7.64 GiB Max Size (High Water): 4:1 30.54 GiB ARC Size Breakdown: Recently Used Cache Size: 62.35% 17.62 GiB Frequently Used Cache Size: 37.65% 10.64 GiB ARC Hash Breakdown: Elements Max: 1.99m Elements Current: 99.16% 1.98m Collisions: 8.97m Chain Max: 14 Chains: 586.97k ------------------------------------------------------------------------ ARC Efficiency: 1.15b Cache Hit Ratio: 97.66% 1.12b Cache Miss Ratio: 2.34% 26.80m Actual Hit Ratio: 72.75% 833.30m Data Demand Efficiency: 98.39% 33.94m Data Prefetch Efficiency: 8.11% 7.60m CACHE HITS BY CACHE LIST: Anonymously Used: 23.88% 267.15m Most Recently Used: 4.70% 52.60m Most Frequently Used: 69.79% 780.70m Most Recently Used Ghost: 0.64% 7.13m Most Frequently Used Ghost: 0.98% 10.99m CACHE HITS BY DATA TYPE: Demand Data: 2.99% 33.40m Prefetch Data: 0.06% 616.42k Demand Metadata: 71.38% 798.44m Prefetch Metadata: 25.58% 286.13m CACHE MISSES BY DATA TYPE: Demand Data: 2.04% 546.67k Prefetch Data: 26.07% 6.99m Demand Metadata: 37.96% 10.18m Prefetch Metadata: 33.93% 9.09m ------------------------------------------------------------------------ L2 ARC Summary: (HEALTHY) Passed Headroom: 3.62m Tried Lock Failures: 3.17m IO In Progress: 21.18k Low Memory Aborts: 20 Free on Write: 7.07k Writes While Full: 134 R/W Clashes: 1.63k Bad Checksums: 0 IO Errors: 0 SPA Mismatch: 0 L2 ARC Size: (Adaptive) 22.70 GiB Header Size: 0.31% 71.02 MiB L2 ARC Breakdown: 23.78m Hit Ratio: 34.25% 8.15m Miss Ratio: 65.75% 15.64m Feeds: 63.47k L2 ARC Buffer: Bytes Scanned: 65.51 TiB Buffer Iterations: 63.47k List Iterations: 4.06m NULL List Iterations: 64.89k L2 ARC Writes: Writes Sent: 100.00% 29.89k ------------------------------------------------------------------------ File-Level Prefetch: (HEALTHY) DMU Efficiency: 1.24b Hit Ratio: 64.29% 798.62m Miss Ratio: 35.71% 443.54m Colinear: 443.54m Hit Ratio: 0.00% 20.45k Miss Ratio: 100.00% 443.52m Stride: 772.29m Hit Ratio: 99.99% 772.21m Miss Ratio: 0.01% 81.30k DMU Misc: Reclaim: 443.52m Successes: 0.05% 220.47k Failures: 99.95% 443.30m Streams: 26.42m +Resets: 0.05% 12.73k -Resets: 99.95% 26.41m Bogus: 0 ------------------------------------------------------------------------ VDEV cache is disabled ------------------------------------------------------------------------ ZFS Tunables (sysctl): kern.maxusers 384 vm.kmem_size 66662760448 vm.kmem_size_scale 1 vm.kmem_size_min 0 vm.kmem_size_max 329853485875 vfs.zfs.l2c_only_size 5242113536 vfs.zfs.mfu_ghost_data_lsize 178520064 vfs.zfs.mfu_ghost_metadata_lsize 6486959104 vfs.zfs.mfu_ghost_size 6665479168 vfs.zfs.mfu_data_lsize 11863127552 vfs.zfs.mfu_metadata_lsize 123386368 vfs.zfs.mfu_size 12432947200 vfs.zfs.mru_ghost_data_lsize 14095171584 vfs.zfs.mru_ghost_metadata_lsize 8351076864 vfs.zfs.mru_ghost_size 22446248448 vfs.zfs.mru_data_lsize 2076449280 vfs.zfs.mru_metadata_lsize 4655490560 vfs.zfs.mru_size 7074721792 vfs.zfs.anon_data_lsize 0 vfs.zfs.anon_metadata_lsize 0 vfs.zfs.anon_size 1605632 vfs.zfs.l2arc_norw 1 vfs.zfs.l2arc_feed_again 1 vfs.zfs.l2arc_noprefetch 1 vfs.zfs.l2arc_feed_min_ms 200 vfs.zfs.l2arc_feed_secs 1 vfs.zfs.l2arc_headroom 2 vfs.zfs.l2arc_write_boost 52428800 vfs.zfs.l2arc_write_max 26214400 vfs.zfs.arc_meta_limit 16398159872 vfs.zfs.arc_meta_used 16398120264 vfs.zfs.arc_min 8199079936 vfs.zfs.arc_max 32796319744 vfs.zfs.dedup.prefetch 1 vfs.zfs.mdcomp_disable 0 vfs.zfs.write_limit_override 0 vfs.zfs.write_limit_inflated 206088929280 vfs.zfs.write_limit_max 8587038720 vfs.zfs.write_limit_min 33554432 vfs.zfs.write_limit_shift 3 vfs.zfs.no_write_throttle 0 vfs.zfs.zfetch.array_rd_sz 1048576 vfs.zfs.zfetch.block_cap 256 vfs.zfs.zfetch.min_sec_reap 2 vfs.zfs.zfetch.max_streams 8 vfs.zfs.prefetch_disable 0 vfs.zfs.mg_alloc_failures 12 vfs.zfs.check_hostid 1 vfs.zfs.recover 0 vfs.zfs.txg.synctime_ms 1000 vfs.zfs.txg.timeout 5 vfs.zfs.vdev.cache.bshift 16 vfs.zfs.vdev.cache.size 0 vfs.zfs.vdev.cache.max 16384 vfs.zfs.vdev.write_gap_limit 4096 vfs.zfs.vdev.read_gap_limit 32768 vfs.zfs.vdev.aggregation_limit 131072 vfs.zfs.vdev.ramp_rate 2 vfs.zfs.vdev.time_shift 6 vfs.zfs.vdev.min_pending 4 vfs.zfs.vdev.max_pending 128 vfs.zfs.vdev.bio_flush_disable 0 vfs.zfs.cache_flush_disable 0 vfs.zfs.zil_replay_disable 0 vfs.zfs.zio.use_uma 0 vfs.zfs.snapshot_list_prefetch 0 vfs.zfs.version.zpl 5 vfs.zfs.version.spa 28 vfs.zfs.version.acl 1 vfs.zfs.debug 0 vfs.zfs.super_owner 0
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19E0C908-79F1-43F8-899C-6B60F998D4A5>
