From owner-freebsd-fs@FreeBSD.ORG Wed May 8 21:45:55 2013 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id BF6A9C3B for ; Wed, 8 May 2013 21:45:55 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-qe0-f49.google.com (mail-qe0-f49.google.com [209.85.128.49]) by mx1.freebsd.org (Postfix) with ESMTP id 8636DEF8 for ; Wed, 8 May 2013 21:45:55 +0000 (UTC) Received: by mail-qe0-f49.google.com with SMTP id 7so1446097qeb.22 for ; Wed, 08 May 2013 14:45:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=vFys/IiVV1h3WuDxSWL4Yh+KWOxQE80Aid5LSGg6wcg=; b=jt/n3FmAeM3vTE2+q9ueBL5ThzbgIlFuQBM3G7TRM3k/ZmQclLWIbcFve+VhO1YC02 sgLu3boYCcbQOuZU3KyWahjFLDJ9QyVELyYkcKKyH1VT1Qhv/LYYodE5iIFU5Fc/g6i+ qm5ZiNH2vJd52x7v+TbOVchpVmPMjrHCor7tRniGPo12jrp26bXsxo2z6lS0bLKltage 9PpJv/z9A8zkCX+uHgnZ3t4vi2U8/MsyHPSU61fn1Z32QNk5aYZ2bPQ9cYaBOaAYIPlO 27CRowaoVkzD0jun9z1tAEpvtSlJdoyex9qIdY4Ldzc73pDTgjdWajP2L/xd5WAaJMDi NiIA== MIME-Version: 1.0 X-Received: by 10.49.35.132 with SMTP id h4mr7329327qej.29.1368049549245; Wed, 08 May 2013 14:45:49 -0700 (PDT) Received: by 10.49.1.44 with HTTP; Wed, 8 May 2013 14:45:49 -0700 (PDT) In-Reply-To: References: Date: Wed, 8 May 2013 14:45:49 -0700 Message-ID: Subject: Re: Strange slowdown when cache devices enabled in ZFS From: Freddie Cash To: Brendan Gregg Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 May 2013 21:45:55 -0000 On Wed, May 8, 2013 at 2:35 PM, Brendan Gregg wrote: > Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013): > | > | The following settings in /etc/sysctl.conf prevent the "stalls" > completely, > | even when the L2ARC devices are 100% full and all RAM is wired into the > | ARC. Been running without issues for 5 days now: > | > | vfs.zfs.l2arc_norw=0 # Default is 1 > | vfs.zfs.l2arc_feed_again=0 # Default is 1 > | vfs.zfs.l2arc_noprefetch=0 # Default is 0 > | vfs.zfs.l2arc_feed_min_ms=1000 # Default is 200 > | vfs.zfs.l2arc_write_boost=320000000 # Default is 8 MBps > | vfs.zfs.l2arc_write_max=160000000 # Default is 8 MBps > | > | With these settings, I'm also able to expand the ARC to use the full 128 > GB > | of RAM in the biggest box, and to use both L2ARC devices (60 GB in > total). > | And, can set primarycache and secondarycache to all (the default) instead > | of just metadata. > |[...] > > The thread earlier described a 100% CPU-bound l2arc_feed_thread, which > could be caused by these settings: > > vfs.zfs.l2arc_write_boost=320000000 # Default is 8 MBps > vfs.zfs.l2arc_write_max=160000000 # Default is 8 MBps > > If I'm reading that correctly, it's increasing the write max and boost to > be 160 Mbytes and 320 Mbytes. To satisfy these, the L2ARC must scan memory > from the tail of the ARC lists, lists which may be composed of tiny buffers > (eg, 8k). Increasing that scan 20 fold could saturate a CPU. And, if it > doesn't find many bytes to write out, then it will rescan the same buffers > on the next interval, wasting CPU cycles. > > I understand the intent was probably to warm up the L2ARC faster. There is > no easy way to do this: you are bounded by the throughput of random reads > from the pool disks. > > Random read workloads usually have a 4 - 16 Kbyte record size. The l2arc > feed thread can't eat uncached data faster than the random reads can be > read from disk. Therefore, at 8 Kbytes, you need at least 1,000 random read > disk IOPS to achieve a rate of 8 Mbytes from the ARC list tails, which, for > rotational disks performing roughly 100 random IOPS (use a different rate > if you like), means about a dozen disks - depending on the ZFS RAID config. > All to feed at 8 Mbytes/sec. This is why 8 Mbytes/sec (plus the boost) is > the default. > > To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least > 20,000 random read disk IOPS. How many spindles does that take? A lot. Do > you have a lot? > > 45x 2 TB SATA harddrives, configured in raidz2 vdevs of 6 disks each for a total of 7 vdevs (with a few spare disks). With 2x SSD for log+OS and 2x SSD for cache. With plans to expand that out with another 45-disk JBOD next summer-ish (2014) With the settings above, I get 120 MBps of writes to the L2ARC until each SSD is over 90% full (boot), then it settles around 5-10 MBps while receiving snapshots from the other 3 servers. I guess I could change the settings to make the _boost 100-odd MBps and leave the _max at the default. I'll play with the l2arc_write_* settings to see if that makes a difference with l2arc_norw enabled. -- Freddie Cash fjwcash@gmail.com