From owner-freebsd-fs@FreeBSD.ORG  Wed May  8 21:45:55 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id BF6A9C3B
 for <freebsd-fs@freebsd.org>; Wed,  8 May 2013 21:45:55 +0000 (UTC)
 (envelope-from fjwcash@gmail.com)
Received: from mail-qe0-f49.google.com (mail-qe0-f49.google.com
 [209.85.128.49]) by mx1.freebsd.org (Postfix) with ESMTP id 8636DEF8
 for <freebsd-fs@freebsd.org>; Wed,  8 May 2013 21:45:55 +0000 (UTC)
Received: by mail-qe0-f49.google.com with SMTP id 7so1446097qeb.22
 for <freebsd-fs@freebsd.org>; Wed, 08 May 2013 14:45:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type;
 bh=vFys/IiVV1h3WuDxSWL4Yh+KWOxQE80Aid5LSGg6wcg=;
 b=jt/n3FmAeM3vTE2+q9ueBL5ThzbgIlFuQBM3G7TRM3k/ZmQclLWIbcFve+VhO1YC02
 sgLu3boYCcbQOuZU3KyWahjFLDJ9QyVELyYkcKKyH1VT1Qhv/LYYodE5iIFU5Fc/g6i+
 qm5ZiNH2vJd52x7v+TbOVchpVmPMjrHCor7tRniGPo12jrp26bXsxo2z6lS0bLKltage
 9PpJv/z9A8zkCX+uHgnZ3t4vi2U8/MsyHPSU61fn1Z32QNk5aYZ2bPQ9cYaBOaAYIPlO
 27CRowaoVkzD0jun9z1tAEpvtSlJdoyex9qIdY4Ldzc73pDTgjdWajP2L/xd5WAaJMDi
 NiIA==
MIME-Version: 1.0
X-Received: by 10.49.35.132 with SMTP id h4mr7329327qej.29.1368049549245; Wed,
 08 May 2013 14:45:49 -0700 (PDT)
Received: by 10.49.1.44 with HTTP; Wed, 8 May 2013 14:45:49 -0700 (PDT)
In-Reply-To: <CA+XzFFgG+Js2w+HJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ@mail.gmail.com>
References: <CA+XzFFgG+Js2w+HJFXXd=opsdnR7Z0n1ThPPtMM1qFsPg-dsaQ@mail.gmail.com>
Date: Wed, 8 May 2013 14:45:49 -0700
Message-ID: <CAOjFWZ6CzbYSSnso-rqDWaA=VxcDBx+KG=6KX3oT2ijbECm=sQ@mail.gmail.com>
Subject: Re: Strange slowdown when cache devices enabled in ZFS
From: Freddie Cash <fjwcash@gmail.com>
To: Brendan Gregg <brendan.gregg@joyent.com>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
Cc: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 08 May 2013 21:45:55 -0000

On Wed, May 8, 2013 at 2:35 PM, Brendan Gregg <brendan.gregg@joyent.com>wrote:

> Freddie Cash wrote (Mon Apr 29 16:01:55 UTC 2013):
> |
> | The following settings in /etc/sysctl.conf prevent the "stalls"
> completely,
> | even when the L2ARC devices are 100% full and all RAM is wired into the
> | ARC.  Been running without issues for 5 days now:
> |
> | vfs.zfs.l2arc_norw=0                                  # Default is 1
> | vfs.zfs.l2arc_feed_again=0                         # Default is 1
> | vfs.zfs.l2arc_noprefetch=0                          # Default is 0
> | vfs.zfs.l2arc_feed_min_ms=1000                 # Default is 200
> | vfs.zfs.l2arc_write_boost=320000000           # Default is 8 MBps
> | vfs.zfs.l2arc_write_max=160000000             # Default is 8 MBps
> |
> | With these settings, I'm also able to expand the ARC to use the full 128
> GB
> | of RAM in the biggest box, and to use both L2ARC devices (60 GB in
> total).
> | And, can set primarycache and secondarycache to all (the default) instead
> | of just metadata.
> |[...]
>
> The thread earlier described a 100% CPU-bound l2arc_feed_thread, which
> could be caused by these settings:
>
> vfs.zfs.l2arc_write_boost=320000000           # Default is 8 MBps
> vfs.zfs.l2arc_write_max=160000000             # Default is 8 MBps
>
> If I'm reading that correctly, it's increasing the write max and boost to
> be 160 Mbytes and 320 Mbytes. To satisfy these, the L2ARC must scan memory
> from the tail of the ARC lists, lists which may be composed of tiny buffers
> (eg, 8k). Increasing that scan 20 fold could saturate a CPU. And, if it
> doesn't find many bytes to write out, then it will rescan the same buffers
> on the next interval, wasting CPU cycles.
>
> I understand the intent was probably to warm up the L2ARC faster. There is
> no easy way to do this: you are bounded by the throughput of random reads
> from the pool disks.
>
> Random read workloads usually have a 4 - 16 Kbyte record size. The l2arc
> feed thread can't eat uncached data faster than the random reads can be
> read from disk. Therefore, at 8 Kbytes, you need at least 1,000 random read
> disk IOPS to achieve a rate of 8 Mbytes from the ARC list tails, which, for
> rotational disks performing roughly 100 random IOPS (use a different rate
> if you like), means about a dozen disks - depending on the ZFS RAID config.
> All to feed at 8 Mbytes/sec. This is why 8 Mbytes/sec (plus the boost) is
> the default.
>
> To feed at 160 Mbytes/sec, with an 8 Kbyte recsize, you'll need at least
> 20,000 random read disk IOPS. How many spindles does that take? A lot. Do
> you have a lot?
>
>
45x 2 TB SATA harddrives, configured in raidz2 vdevs of 6 disks each for a
total of 7 vdevs (with a few spare disks).  With 2x SSD for log+OS and 2x
SSD for cache.

With plans to expand that out with another 45-disk JBOD next summer-ish
(2014)

With the settings above, I get 120 MBps of writes to the L2ARC until each
SSD is over 90% full (boot), then it settles around 5-10 MBps while
receiving snapshots from the other 3 servers.

I guess I could change the settings to make the _boost 100-odd MBps and
leave the _max at the default.  I'll play with the l2arc_write_* settings
to see if that makes a difference with l2arc_norw enabled.

-- 
Freddie Cash
fjwcash@gmail.com