Date: Wed, 8 May 2013 14:45:49 -0700 From: Freddie Cash <fjwcash@gmail.com> To: Brendan Gregg <brendan.gregg@joyent.com> Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org> Subject: Re: Strange slowdown when cache devices enabled in ZFS Message-ID: <CAOjFWZ6CzbYSSnso-rqDWaA=VxcDBx%2BKG=6KX3oT2ijbECm=sQ@mail.gmail.com> In-Reply-To: <CA%2BXzFFgG%2BJs2w%2BHJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ@mail.gmail.com> References: <CA%2BXzFFgG%2BJs2w%2BHJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, May 8, 2013 at 2:35 PM, Brendan Gregg <brendan.gregg@joyent.com>wrote: > Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013): > | > | The following settings in /etc/sysctl.conf prevent the "stalls" > completely, > | even when the L2ARC devices are 100% full and all RAM is wired into the > | ARC. Been running without issues for 5 days now: > | > | vfs.zfs.l2arc_norw=0 # Default is 1 > | vfs.zfs.l2arc_feed_again=0 # Default is 1 > | vfs.zfs.l2arc_noprefetch=0 # Default is 0 > | vfs.zfs.l2arc_feed_min_ms=1000 # Default is 200 > | vfs.zfs.l2arc_write_boost=320000000 # Default is 8 MBps > | vfs.zfs.l2arc_write_max=160000000 # Default is 8 MBps > | > | With these settings, I'm also able to expand the ARC to use the full 128 > GB > | of RAM in the biggest box, and to use both L2ARC devices (60 GB in > total). > | And, can set primarycache and secondarycache to all (the default) instead > | of just metadata. > |[...] > > The thread earlier described a 100% CPU-bound l2arc_feed_thread, which > could be caused by these settings: > > vfs.zfs.l2arc_write_boost=320000000 # Default is 8 MBps > vfs.zfs.l2arc_write_max=160000000 # Default is 8 MBps > > If I'm reading that correctly, it's increasing the write max and boost to > be 160 Mbytes and 320 Mbytes. To satisfy these, the L2ARC must scan memory > from the tail of the ARC lists, lists which may be composed of tiny buffers > (eg, 8k). Increasing that scan 20 fold could saturate a CPU. And, if it > doesn't find many bytes to write out, then it will rescan the same buffers > on the next interval, wasting CPU cycles. > > I understand the intent was probably to warm up the L2ARC faster. There is > no easy way to do this: you are bounded by the throughput of random reads > from the pool disks. > > Random read workloads usually have a 4 - 16 Kbyte record size. The l2arc > feed thread can't eat uncached data faster than the random reads can be > read from disk. Therefore, at 8 Kbytes, you need at least 1,000 random read > disk IOPS to achieve a rate of 8 Mbytes from the ARC list tails, which, for > rotational disks performing roughly 100 random IOPS (use a different rate > if you like), means about a dozen disks - depending on the ZFS RAID config. > All to feed at 8 Mbytes/sec. This is why 8 Mbytes/sec (plus the boost) is > the default. > > To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least > 20,000 random read disk IOPS. How many spindles does that take? A lot. Do > you have a lot? > > 45x 2 TB SATA harddrives, configured in raidz2 vdevs of 6 disks each for a total of 7 vdevs (with a few spare disks). With 2x SSD for log+OS and 2x SSD for cache. With plans to expand that out with another 45-disk JBOD next summer-ish (2014) With the settings above, I get 120 MBps of writes to the L2ARC until each SSD is over 90% full (boot), then it settles around 5-10 MBps while receiving snapshots from the other 3 servers. I guess I could change the settings to make the _boost 100-odd MBps and leave the _max at the default. I'll play with the l2arc_write_* settings to see if that makes a difference with l2arc_norw enabled. -- Freddie Cash fjwcash@gmail.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOjFWZ6CzbYSSnso-rqDWaA=VxcDBx%2BKG=6KX3oT2ijbECm=sQ>