From owner-freebsd-current@FreeBSD.ORG  Wed Mar 18 19:43:42 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 83D49106567B
	for <freebsd-current@freebsd.org>; Wed, 18 Mar 2009 19:43:42 +0000 (UTC)
	(envelope-from ben@wanderview.com)
Received: from mail.wanderview.com (mail.wanderview.com [66.92.166.102])
	by mx1.freebsd.org (Postfix) with ESMTP id BD8D28FC0A
	for <freebsd-current@freebsd.org>; Wed, 18 Mar 2009 19:43:41 +0000 (UTC)
	(envelope-from ben@wanderview.com)
Received: from harkness.in.wanderview.com (harkness.in.wanderview.com
	[10.76.10.150]) (authenticated bits=0)
	by mail.wanderview.com (8.14.3/8.14.3) with ESMTP id n2IJ4TnG002971
	(version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO)
	for <freebsd-current@freebsd.org>; Wed, 18 Mar 2009 19:04:29 GMT
	(envelope-from ben@wanderview.com)
Message-Id: <DC9F2088-A0AF-467D-8574-F24A045ABD81@wanderview.com>
From: Ben Kelly <ben@wanderview.com>
To: freebsd-current@freebsd.org
Content-Type: multipart/mixed; boundary=Apple-Mail-46--749651897
Mime-Version: 1.0 (Apple Message framework v930.3)
Date: Wed, 18 Mar 2009 15:04:29 -0400
X-Mailer: Apple Mail (2.930.3)
X-Spam-Score: -1.44 () ALL_TRUSTED
X-Scanned-By: MIMEDefang 2.64 on 10.76.20.1
Subject: [patch] zfs livelock and thread priorities
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 18 Mar 2009 19:43:42 -0000


--Apple-Mail-46--749651897
Content-Type: text/plain;
	charset=US-ASCII;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: 7bit

Hello all,

Recently my home server has been going into a livelock state during my  
nightly backups.  The server has two zfs pools that look like this:

         NAME                   STATE     READ WRITE CKSUM
         backup                 ONLINE       0     0     0
           mirror               ONLINE       0     0     0
             label/backup0.eli  ONLINE       0     0     0
             label/backup1.eli  ONLINE       0     0     0
             label/backup2.eli  ONLINE       0     0     0

         NAME        STATE     READ WRITE CKSUM
         tank        ONLINE       0     0     0
           ad0s3     ONLINE       0     0     0

I've had the tank pool for quite a while, but just recently added the  
backup pool.  Previously I had been backing up to a smaller disk using  
UFS+geli.  The server is running a very recent CURRENT.

With this configuration I could get the server into a bad state within  
15 minutes by running my rsync backup script.  I call it a livelock,  
but I'm not really sure if thats the right term.  Existing shell  
sessions would work for a while, but eventually get wedged.  Sometimes  
the serial console would stay responsive for a while, but it also  
eventually stopped responding to input.  The server would continue to  
respond to pings and other network traffic.  I was not able to log  
into new shells.

To try to figure out what was going on I left top running while I  
triggered the problem.  The following processes were using all the CPU:

   PID USERNAME PRI NICE   SIZE    RES STATE    TIME    CPU COMMAND
   210 root      46    -     0K    20K RUN      0:06 16.70%  
txg_thread_enter
  1207 root      44    0  3280K  1296K RUN      0:05 13.87% syslogd
   889 root      44    0  5436K  2852K RUN      0:04 13.09% apcupsd
    33 root      44    -     0K     8K RUN      0:05 12.99% syncer
  1696 root      44    0  3280K  1344K RUN      0:01 12.79% syslogd
   727 root      44    0  3280K  1496K RUN      0:01 12.60% syslogd
  2065 root      44    0  3280K  1344K RUN      0:01 12.60% syslogd
   209 root      46    -     0K     8K tx->tx   0:02  4.79%  
txg_thread_enter
  2485 root      44    0  4636K  2664K RUN      0:00  0.10% top
    11 root     171 ki31     0K     8K RUN      0:15  0.00% idle

The exact processes varied, but txg_thread_enter was always in the  
mix.  The syncer was not always present.  I also noticed that lower in  
the list was a spa_zio process set to RUN:

  2458 root      49    -     0K     8K RUN      0:00  0.00% spa_zio

Using the serial console I broke to the debugger and got some stack  
traces.  All of the spinning processes were in some form of  
txg_wait_open() or txg_thread_wait().  For example:

Tracing command syslogd pid 2065 tid 100174 td 0xa3a568c0
sched_switch(a3a568c0,0,104,f744824c,17e,...) at sched_switch+0x1fc
mi_switch(104,0,a3a568c0,a3a568c0,dfc729a0,...) at mi_switch+0x12a
sleepq_switch(a3a568c0,0,9e88745b,247,0,...) at sleepq_switch+0xcb
sleepq_wait(a35181f0,0,a321b2ab,1,0,...) at sleepq_wait+0x39
_cv_wait(a35181f0,a3518180,a321b186,1d3,a35181f0,...) at _cv_wait+0x188
txg_wait_open(a3518000,2542a87,0,0,cc00,...) at txg_wait_open+0xb5
dmu_tx_wait(ab4d7a00,2,0,cbb5,0,...) at dmu_tx_wait+0x138
zfs_freebsd_write(dfc72be0,0,9e88c07e,0,a62f5df4,...) at  
zfs_freebsd_write+0x3e0
VOP_WRITE_APV(a3221580,dfc72be0,9e88c07e,25a,dfc72c70,...) at  
VOP_WRITE_APV+0xa6
vn_write(a5689690,bcf74780,a3ec7700,0,a3a568c0,...) at vn_write+0x1b4
dofilewrite(bcf74780,ffffffff,ffffffff,0,a5689690,...) at dofilewrite 
+0x97
kern_writev(a3a568c0,e,bcf74780,bcf74780,0,...) at kern_writev+0x58
writev(a3a568c0,dfc72cf8,c,9e5f0500,a3a568c0,...) at writev+0x46
syscall(dfc72d38) at syscall+0x325
Xint0x80_syscall() at Xint0x80_syscall+0x20

The one exception was a single txg_thread_enter process that appeared  
to be waiting on zio:

Tracing command txg_thread_enter pid 2469 tid 100252 td 0xa48f4690
sched_switch(a48f4690,0,104,9cb79a82,139,...) at sched_switch+0x1fc
mi_switch(104,0,a48f4690,a48f4690,dfd7ba80,...) at mi_switch+0x12a
sleepq_switch(a48f4690,0,9e88745b,247,0,...) at sleepq_switch+0xcb
sleepq_wait(adff56ec,0,a321cffb,1,0,...) at sleepq_wait+0x39
_cv_wait(adff56ec,adff56d8,a321cf9b,3fe,3b9aca00,...) at _cv_wait+0x188
zio_wait(adff54a8,3ec8,0,0,9f055a60,...) at zio_wait+0x62
dsl_pool_sync(a2d0a400,3ec8,0,0,0,...) at dsl_pool_sync+0x112
--More--        spa_sync(a2ff5000,3ec8,0,0,3e8,...) at spa_sync+0x3b5
txg_sync_thread(a2d0a400,dfd7bd38,0,0,0,...) at txg_sync_thread+0x3b4
fork_exit(a3182790,a2d0a400,dfd7bd38) at fork_exit+0x90
fork_trampoline() at fork_trampoline+0x8

All of the spa_zio processes were blocked waiting on task queues  
except for the one spa_zio process marked RUN.  It appeared to have  
been interrupted:

Tracing command spa_zio pid 2458 tid 100245 td 0xa48d1d20
sched_switch(a48d1d20,0,602,d1125c99,17e,...) at sched_switch+0x1fc
mi_switch(602,0,9e884bce,bc,0,...) at mi_switch+0x12a
--More--         
critical_exit(9e8f29a8,a48d1d20,9e8f29a8,a2c33580,e,...) at  
critical_exit+0x92
intr_event_handle(a2c33580,dfd62b54,dfd62b48,c3c85000,e,...) at  
intr_event_handle+0xaa
intr_execute_handlers 
(9e8f29a8,dfd62b54,c3c85000,c381b284,dfd62c10,...) at  
intr_execute_handlers+0x>
atpic_handle_intr(e,dfd62b54) at atpic_handle_intr+0x67
Xatpic_intr14() at Xatpic_intr14+0x21
--- interrupt, eip = 0x9e821e7e, esp = 0xdfd62b94, ebp = 0xdfd62c10 ---
generic_bcopy(a4c4ab68,c1757254,a321c015,134,a4c4ab2c,...) at  
generic_bcopy+0x1a
vdev_queue_io_done(c1757254,0,a324aad8,a9d0c4c0,a9d0c4c0,...) at  
vdev_queue_io_done+0xb2
zio_vdev_io_done(c1757254,a31316e2,a324aad8,0,a3214fd8,...) at  
zio_vdev_io_done+0x72
zio_execute(c1757254,0,a3214fd8,352,6a7334a1,...) at zio_execute+0x62
taskq_thread(a324aab8,dfd62d38,0,0,0,...) at taskq_thread+0x1a0
fork_exit(a3132030,a324aab8,dfd62d38) at fork_exit+0x90
fork_trampoline() at fork_trampoline+0x8

Across all the livelocks I triggered the spa_zio process was always  
marked to RUN and appeared to have been interrupted.  Also, if you  
look at the top output it is running at a lower priority than the  
other spinning threads.

Its not clear to me why the top threads were spinning in the txg  
calls.  The zfs code, however, does spawn the txg threads at  
minclsyspri and the zio threads at maxclsyspri.  It seems that their  
intention was that the zio threads should preempt the txg processing.   
Currently in FreeBSD, however, these priority values are ignored.  In  
this case, at least, it looks like the priorities may be required by  
the zfs design.

To test this theory I applied the attached patch to sets the zfs  
thread priorities.  I chose the priority values to be similar to  
PRIBIO but so that minclsyspri is still lower priority than  
maxclsyspri.  The exact values are somewhat arbitrary.  With these  
changes I have not been able to trigger the livelock condition again.   
(Also, it occurred to me that this might make the arc_reclaim_thread  
more responsive to system memory pressure.)

I should note, however, that the livelock failure is somewhat timing  
dependent.  When I enabled WITNESS at one point it changed the timing  
to the point where the livelock would only occur after running the  
backup for hours instead of minutes.  Its possible the thread priority  
change has only changed the timing, but not really fixed the problem.   
Also, this patch might affect dtrace, but I have not tested that.

Does my analysis look reasonable?  Should the spa_zio thread have been  
rescheduled even without the priority change?  Is there a better way  
to attack this problem?  Any input would be greatly appreciated.

Thank you.

- Ben


--Apple-Mail-46--749651897
Content-Disposition: attachment;
	filename=zfs_thread_priority.diff
Content-Type: application/octet-stream; x-unix-mode=0644;
	name="zfs_thread_priority.diff"
Content-Transfer-Encoding: 7bit

Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c	(revision 187)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c	(working copy)
@@ -32,6 +32,7 @@
 #include <sys/vdev_impl.h>
 #include <sys/fs/zfs.h>
 #include <sys/zio.h>
+#include <sys/sched.h>
 #include <geom/geom.h>
 #include <geom/geom_int.h>
 
@@ -194,6 +195,8 @@
 	zio_t *zio;
 	struct bio *bp;
 
+	sched_prio(curthread, PRIBIO);
+
 	ctx = arg;
 	for (;;) {
 		mtx_lock(&ctx->gc_queue_mtx);
Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
===================================================================
--- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c	(revision 187)
+++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c	(working copy)
@@ -72,6 +72,7 @@
 #include <sys/zfs_rlock.h>
 #include <sys/vdev_impl.h>
 #include <sys/zvol.h>
+#include <sys/sched.h>
 #include <geom/geom.h>
 
 #include "zfs_namecheck.h"
@@ -402,6 +403,8 @@
 	zvol_state_t *zv;
 	struct bio *bp;
 
+	sched_prio(curthread, PRIBIO);
+
 	zv = arg;
 	for (;;) {
 		mtx_lock(&zv->zv_queue_mtx);
Index: sys/cddl/compat/opensolaris/sys/proc.h
===================================================================
--- sys/cddl/compat/opensolaris/sys/proc.h	(revision 187)
+++ sys/cddl/compat/opensolaris/sys/proc.h	(working copy)
@@ -35,12 +35,14 @@
 #include <sys/stdint.h>
 #include <sys/smp.h>
 #include <sys/debug.h>
+#include <sys/sched.h>
+#include <sys/priority.h>
 
 #ifdef _KERNEL
 
 #define	CPU		curcpu
-#define	minclsyspri	0
-#define	maxclsyspri	0
+#define	minclsyspri	(PRIBIO + 2)
+#define	maxclsyspri	(PRIBIO + 1)
 #define	max_ncpus	mp_ncpus
 #define	boot_max_ncpus	mp_ncpus
 
@@ -60,6 +62,7 @@
 {
 	proc_t *p;
 	int error;
+	kthread_t *t;
 
 	/*
 	 * Be sure there are no surprises.
@@ -70,7 +73,13 @@
 
 	error = kproc_create(proc, arg, &p, 0, stksize / PAGE_SIZE,
 	    "solthread %p", proc);
-	return (error == 0 ? FIRST_THREAD_IN_PROC(p) : NULL);
+	if (error == 0) {
+		t = FIRST_THREAD_IN_PROC(p);
+		sched_prio(t, pri);
+		return t;
+	} else {
+		return NULL;
+	}
 }
 
 #define	thread_exit()	kproc_exit(0)

--Apple-Mail-46--749651897
Content-Type: text/plain;
	charset=US-ASCII;
	format=flowed
Content-Transfer-Encoding: 7bit


--Apple-Mail-46--749651897--