Date: Wed, 8 May 2013 14:35:46 -0700 From: Brendan Gregg <brendan.gregg@joyent.com> To: freebsd-fs@freebsd.org Subject: Re: Strange slowdown when cache devices enabled in ZFS Message-ID: <CA%2BXzFFgG%2BJs2w%2BHJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013): | | The following settings in /etc/sysctl.conf prevent the "stalls" completely, | even when the L2ARC devices are 100% full and all RAM is wired into the | ARC. Been running without issues for 5 days now: | | vfs.zfs.l2arc_norw=3D0 # Default is 1 | vfs.zfs.l2arc_feed_again=3D0 # Default is 1 | vfs.zfs.l2arc_noprefetch=3D0 # Default is 0 | vfs.zfs.l2arc_feed_min_ms=3D1000 # Default is 200 | vfs.zfs.l2arc_write_boost=3D320000000 # Default is 8 MBps | vfs.zfs.l2arc_write_max=3D160000000 # Default is 8 MBps | | With these settings, I'm also able to expand the ARC to use the full 128 GB | of RAM in the biggest box, and to use both L2ARC devices (60 GB in total)= . | And, can set primarycache and secondarycache to all (the default) instead | of just metadata. |[...] The thread earlier described a 100% CPU-bound l2arc_feed_thread, which could be caused by these settings: vfs.zfs.l2arc_write_boost=3D320000000 # Default is 8 MBps vfs.zfs.l2arc_write_max=3D160000000 # Default is 8 MBps If I'm reading that correctly, it's increasing the write max and boost to be 160 Mbytes and 320 Mbytes. To satisfy these, the L2ARC must scan memory from the tail of the ARC lists, lists which may be composed of tiny buffers (eg, 8k). Increasing that scan 20 fold could saturate a CPU. And, if it doesn't find many bytes to write out, then it will rescan the same buffers on the next interval, wasting CPU cycles. I understand the intent was probably to warm up the L2ARC faster. There is no easy way to do this: you are bounded by the throughput of random reads from the pool disks. Random read workloads usually have a 4 - 16 Kbyte record size. The l2arc feed thread can't eat uncached data faster than the random reads can be read from disk. Therefore, at 8 Kbytes, you need at least 1,000 random read disk IOPS to achieve a rate of 8 Mbytes from the ARC list tails, which, for rotational disks performing roughly 100 random IOPS (use a different rate if you like), means about a dozen disks - depending on the ZFS RAID config. All to feed at 8 Mbytes/sec. This is why 8 Mbytes/sec (plus the boost) is the default. To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least 20,000 random read disk IOPS. How many spindles does that take? A lot. Do you have a lot? I wanted to point this out because the warm up problem isn't the l2arc_feed_thread (that it scans, how far it scans, whether it rescans, etc) =E2=80=93 it's the input to the system. ... I just noticed that the https://wiki.freebsd.org/ZFSTuningGuide writes: " vfs.zfs.l2arc_write_max vfs.zfs.l2arc_write_boost The former value sets the runtime max that data will be loaded into L2ARC. The latter can be used to accelerate the loading of a freshly booted system. For a device capable of 400MB/sec reasonable values might be 200MB and 380MB respectively. Note that the same caveats apply about these sysctls and pool imports as the previous one. Setting these values properly is the difference between an L2ARC subsystem that can take days to heat up versus one that heats up in minutes. " This advise seems a little unwise: you could tune the feed rates that high =E2=80=93 if you have enough spindles to feed it =E2=80=93 but I think for = most people this will waste CPU cycles failing to find buffers to cache. Can the author please double check? Brendan --=20 Brendan Gregg, Joyent http://dtrace.org/blogs/brendan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CA%2BXzFFgG%2BJs2w%2BHJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ>