Date: Thu, 14 Mar 2013 11:13:38 -0700 From: Freddie Cash <fjwcash@gmail.com> To: FreeBSD Filesystems <freebsd-fs@freebsd.org> Subject: Strange slowdown when cache devices enabled in ZFS Message-ID: <CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
3 storage systems are running this: # uname -a FreeBSD alphadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r245466M: Fri Feb 1 09:38:24 PST 2013 root@alphadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST amd64 1 storage system is running this: # uname -a FreeBSD omegadrive.sd73.bc.ca 9.1-STABLE FreeBSD 9.1-STABLE #0 r247804M: Mon Mar 4 10:27:26 PST 2013 root@omegadrive.sd73.bc.ca:/usr/obj/usr/src/sys/ZFSHOST amd64 The last system has manually merged the ZFS "deadman" patch (r 247265 from -CURRENT). All 4 systems exhibit the same symptoms: if a cache device is enabled in the pool, the l2arc_feed_thread of zfskern will spin until it takes up 100% of a CPU core, at which point all I/O to the pool stops. "zpool iostat 1" and "zpool iostat -v 1" show 0 reads and 0 writes to the pool. "gstat -I 1s -f gpt" shows 0 activity to the pool disks. If I remove the cache device from the pool, I/O starts up right away (although it takes several minutes for the remove operation to complete). During the "0 I/O period", any attempt to access the pool "hangs". CTRL+T shows either spa_namespace_lock or tx->tx_something or other (the one when trying to write a transaction to disk). And it will stay like that until the cache device is removed. Hardware is almost the same in all 4 boxes: 3x storage boxes: alphadrive: SuperMicro H8DGi-F motherboard AMD Opteron 6128 CPU (8 cores at 2.0 GHz) 64 GB of DDR3 ECC SDRAM in one box 32 GB SSD for the OS and cache device (GPT partitioned) 24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs) SuperMicro AOC-USAS-8i SATA controller using mpt driver SuperMicro 4U chassis betadrive: SuperMicro H8DGi-F motherboard AMD Opteron 6128 CPU (8 cores at 2.0 GHz) 48 GB of DDR3 ECC SDRAM in one box 32 GB SSD for the OS and cache device (GPT partitioned) 16x 2.0 TB WD and Seagate SATA harddrives (3x 5-drive raidz2 vdevs + spare) SuperMicro AOC-USAS2-8i SATA controller using mps driver SuperMicro 3U chassis zuludrive: SuperMicro H8DGi-F motherboard AMD Opteron 6128 CPU (8 cores at 2.0 GHz) 32 GB of DDR3 ECC SDRAM in one box 32 GB SSD for the OS and cache device (GPT partitioned) 24x 2.0 TB WD and Seagate SATA harddrives (4x 6-drive raidz2 vdevs) SuperMicro AOC-USAS2-8i SATA controller using mps driver SuperMicro 836 chassis 1x storage box: omegadrive: SuperMicro H8DG6-F motherboard 2x AMD Opteron 6128 CPU (8 cores at 2.0 GHz; 16 cores total) 128 GB of DDR3 ECC SDRAM in one box 2x 60 GB SSD for the OS (gmirror'd) and log devices (ZFS mirror) 2x 120 GB SSD for cache devices 45x 2.0 TB WD and Seagate SATA harddrives (7x 6-drive raidz2 vdevs + 3 spares) LSI 9211-8e SAS controllers using mps driver Onboard LSI 2008 SATA controller using mps driver for OS/log/cache SuperMicro 4U JBOD chassis SuperMicro 2U chassis for motherboard/OS alphadrive, betadrive, and omegadrive all have dedup and lzjb compression enabled. zuludrive has lzjb compression enabled (no dedup). alpha/beta/zulu do rsync backups every night from various local and remote Linux and FreeBSD boxes, then ZFS send the snapshot to omegadrive during the day. The "0 I/O periods" occur most often and most quickly on omegadrive when receiving snapshots, but will eventually occur on all systems during the rsyncs. Things I've tried: - limiting ARC to only 32 GB on each system - limiting L2ARC to 30 GB on each system - enabling the "deadman" patch in case it was I/O requests being lost by the drives/controllers - changing primarycache between all and metadata - increasing arc_meta_limit to just shy of arc_max - removing cache devices completely So far, only the last option works. Without L2ARC, the systems are 100% stable, and can push 200 MB/s of rsync writes and just shy of 500 MB/s of ZFS recv (saturates gigabit link, bursts writes; usually hovers around 50-80 MB/s continuous writes). I'm baffled. An L2ARC is supposed to make things faster, especially when using dedup as the DDT can be cached. -- Freddie Cash fjwcash@gmail.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOjFWZ6Q=Vs3P-kfGysLzSbw4CnfrJkMEka4AqfSrQJFZDP_qw>