From owner-freebsd-fs@FreeBSD.ORG  Thu Mar 14 18:13:40 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 0189BD39
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 18:13:40 +0000 (UTC)
 (envelope-from fjwcash@gmail.com)
Received: from mail-qa0-f49.google.com (mail-qa0-f49.google.com
 [209.85.216.49]) by mx1.freebsd.org (Postfix) with ESMTP id BCBBB172
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 18:13:39 +0000 (UTC)
Received: by mail-qa0-f49.google.com with SMTP id o13so1399854qaj.15
 for <freebsd-fs@freebsd.org>; Thu, 14 Mar 2013 11:13:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:x-received:date:message-id:subject:from:to
 :content-type; bh=GYplxzYfiI41tlYW79pzop4H+7jE/XGvwwwev1BwNtU=;
 b=IZDNwfAY+G0fzdBtGAG2Kf/ZQc238NtfwAyalEwiABkUObRexaue6b0maSH1e0y8fl
 CMfT/hfF70AFafIFfwZAN5nUb/FntherTWcnM5Mucu4G63NNQD/95CY67CS0FEBzh9n0
 8U1WFok6nvj2rQFJ7u4Kq6w8tz8Q2OM/goyE9VwB2BrZenSM/4A3TY2Wqz0FHt0Onc9z
 ZJR3Y/LNPoPXSw2ST5OjoBor5g0HUrXngYS5VNXEqtiifeOFPc2c6ePIBAn0fPl/h6zm
 9npvQVjOIxuZexn+Azn6zqPaPWocLestS42hZ4SF+NPtSvG7CApe8j+jXxJ5M4nwQoql
 Xotw==
MIME-Version: 1.0
X-Received: by 10.229.172.162 with SMTP id l34mr713340qcz.81.1363284818828;
 Thu, 14 Mar 2013 11:13:38 -0700 (PDT)
Received: by 10.49.50.67 with HTTP; Thu, 14 Mar 2013 11:13:38 -0700 (PDT)
Date: Thu, 14 Mar 2013 11:13:38 -0700
Message-ID: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
Subject: Strange slowdown when cache devices enabled in ZFS
From: Freddie Cash <fjwcash@gmail.com>
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.14
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Mar 2013 18:13:40 -0000

3 storage systems are running this:
# uname -a
FreeBSD alphadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r245466M:
Fri Feb  1 09:38:24 PST 2013
root@alphadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST
amd64

1 storage system is running this:
# uname -a
FreeBSD omegadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r247804M:
Mon Mar  4 10:27:26 PST 2013
root@omegadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST
amd64

The last system has manually merged the ZFS "deadman" patch (r 247265 from
-CURRENT).

All 4 systems exhibit the same symptoms:  if a cache device is enabled in
the pool, the l2arc_feed_thread of zfskern will spin until it takes up 100%
of a CPU core, at which point all I/O to the pool stops.  "zpool iostat 1"
and "zpool iostat -v 1" show 0 reads and 0 writes to the pool.  "gstat -I
1s -f gpt" shows 0 activity to the pool disks.

If I remove the cache device from the pool, I/O starts up right away
(although it takes several minutes for the remove operation to complete).

During the "0 I/O period", any attempt to access the pool "hangs".  CTRL+T
shows either spa_namespace_lock or tx->tx_something or other (the one when
trying to write a transaction to disk).  And it will stay like that until
the cache device is removed.

Hardware is almost the same in all 4 boxes:

3x storage boxes:
alphadrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    64 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs)
    SuperMicro AOC-USAS-8i SATA controller using mpt driver
    SuperMicro 4U chassis

betadrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    48 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    16x 2.0 TB WD and Seagate SATA harddrives (3x 5-drive raidz2 vdevs +
spare)
    SuperMicro AOC-USAS2-8i SATA controller using mps driver
    SuperMicro 3U chassis

zuludrive:
    SuperMicro H8DGi-F motherboard
    AMD Opteron 6128 CPU (8 cores at 2.0 GHz)
    32 GB of DDR3 ECC SDRAM in one box
    32 GB SSD for the OS and cache device (GPT partitioned)
    24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs)
    SuperMicro AOC-USAS2-8i SATA controller using mps driver
    SuperMicro 836 chassis


1x storage box:
omegadrive:
    SuperMicro H8DG6-F motherboard
    2x AMD Opteron 6128 CPU (8 cores at 2.0 GHz; 16 cores total)
    128 GB of DDR3 ECC SDRAM in one box
    2x 60 GB SSD for the OS (gmirror'd) and log devices (ZFS mirror)
    2x 120 GB SSD for cache devices
    45x 2.0 TB WD and Seagate SATA harddrives (7x 6-drive raidz2 vdevs + 3
spares)
    LSI 9211-8e SAS controllers using mps driver
    Onboard LSI 2008 SATA controller using mps driver for OS/log/cache
    SuperMicro 4U JBOD chassis
    SuperMicro 2U chassis for motherboard/OS

alphadrive, betadrive, and omegadrive all have dedup and lzjb compression
enabled.
zuludrive has lzjb compression enabled (no dedup).

alpha/beta/zulu do rsync backups every night from various local and remote
Linux and FreeBSD boxes, then ZFS send the snapshot to omegadrive during
the day.  The "0 I/O periods" occur most often and most quickly on
omegadrive when receiving snapshots, but will eventually occur on all
systems during the rsyncs.

Things I've tried:
  - limiting ARC to only 32 GB on each system
  - limiting L2ARC to 30 GB on each system
  - enabling the "deadman" patch in case it was I/O requests being lost by
the drives/controllers
  - changing primarycache between all and metadata
  - increasing arc_meta_limit to just shy of arc_max
  - removing cache devices completely

So far, only the last option works.  Without L2ARC, the systems are 100%
stable, and can push 200 MB/s of rsync writes and just shy of 500 MB/s of
ZFS recv (saturates gigabit link, bursts writes; usually hovers around
50-80 MB/s continuous writes).

I'm baffled.  An L2ARC is supposed to make things faster, especially when
using dedup as the DDT can be cached.

-- 
Freddie Cash
fjwcash@gmail.com