From owner-freebsd-current@FreeBSD.ORG Wed Mar 18 19:43:42 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 83D49106567B for ; Wed, 18 Mar 2009 19:43:42 +0000 (UTC) (envelope-from ben@wanderview.com) Received: from mail.wanderview.com (mail.wanderview.com [66.92.166.102]) by mx1.freebsd.org (Postfix) with ESMTP id BD8D28FC0A for ; Wed, 18 Mar 2009 19:43:41 +0000 (UTC) (envelope-from ben@wanderview.com) Received: from harkness.in.wanderview.com (harkness.in.wanderview.com [10.76.10.150]) (authenticated bits=0) by mail.wanderview.com (8.14.3/8.14.3) with ESMTP id n2IJ4TnG002971 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO) for ; Wed, 18 Mar 2009 19:04:29 GMT (envelope-from ben@wanderview.com) Message-Id: From: Ben Kelly To: freebsd-current@freebsd.org Content-Type: multipart/mixed; boundary=Apple-Mail-46--749651897 Mime-Version: 1.0 (Apple Message framework v930.3) Date: Wed, 18 Mar 2009 15:04:29 -0400 X-Mailer: Apple Mail (2.930.3) X-Spam-Score: -1.44 () ALL_TRUSTED X-Scanned-By: MIMEDefang 2.64 on 10.76.20.1 Subject: [patch] zfs livelock and thread priorities X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Mar 2009 19:43:42 -0000 --Apple-Mail-46--749651897 Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Hello all, Recently my home server has been going into a livelock state during my nightly backups. The server has two zfs pools that look like this: NAME STATE READ WRITE CKSUM backup ONLINE 0 0 0 mirror ONLINE 0 0 0 label/backup0.eli ONLINE 0 0 0 label/backup1.eli ONLINE 0 0 0 label/backup2.eli ONLINE 0 0 0 NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 ad0s3 ONLINE 0 0 0 I've had the tank pool for quite a while, but just recently added the backup pool. Previously I had been backing up to a smaller disk using UFS+geli. The server is running a very recent CURRENT. With this configuration I could get the server into a bad state within 15 minutes by running my rsync backup script. I call it a livelock, but I'm not really sure if thats the right term. Existing shell sessions would work for a while, but eventually get wedged. Sometimes the serial console would stay responsive for a while, but it also eventually stopped responding to input. The server would continue to respond to pings and other network traffic. I was not able to log into new shells. To try to figure out what was going on I left top running while I triggered the problem. The following processes were using all the CPU: PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND 210 root 46 - 0K 20K RUN 0:06 16.70% txg_thread_enter 1207 root 44 0 3280K 1296K RUN 0:05 13.87% syslogd 889 root 44 0 5436K 2852K RUN 0:04 13.09% apcupsd 33 root 44 - 0K 8K RUN 0:05 12.99% syncer 1696 root 44 0 3280K 1344K RUN 0:01 12.79% syslogd 727 root 44 0 3280K 1496K RUN 0:01 12.60% syslogd 2065 root 44 0 3280K 1344K RUN 0:01 12.60% syslogd 209 root 46 - 0K 8K tx->tx 0:02 4.79% txg_thread_enter 2485 root 44 0 4636K 2664K RUN 0:00 0.10% top 11 root 171 ki31 0K 8K RUN 0:15 0.00% idle The exact processes varied, but txg_thread_enter was always in the mix. The syncer was not always present. I also noticed that lower in the list was a spa_zio process set to RUN: 2458 root 49 - 0K 8K RUN 0:00 0.00% spa_zio Using the serial console I broke to the debugger and got some stack traces. All of the spinning processes were in some form of txg_wait_open() or txg_thread_wait(). For example: Tracing command syslogd pid 2065 tid 100174 td 0xa3a568c0 sched_switch(a3a568c0,0,104,f744824c,17e,...) at sched_switch+0x1fc mi_switch(104,0,a3a568c0,a3a568c0,dfc729a0,...) at mi_switch+0x12a sleepq_switch(a3a568c0,0,9e88745b,247,0,...) at sleepq_switch+0xcb sleepq_wait(a35181f0,0,a321b2ab,1,0,...) at sleepq_wait+0x39 _cv_wait(a35181f0,a3518180,a321b186,1d3,a35181f0,...) at _cv_wait+0x188 txg_wait_open(a3518000,2542a87,0,0,cc00,...) at txg_wait_open+0xb5 dmu_tx_wait(ab4d7a00,2,0,cbb5,0,...) at dmu_tx_wait+0x138 zfs_freebsd_write(dfc72be0,0,9e88c07e,0,a62f5df4,...) at zfs_freebsd_write+0x3e0 VOP_WRITE_APV(a3221580,dfc72be0,9e88c07e,25a,dfc72c70,...) at VOP_WRITE_APV+0xa6 vn_write(a5689690,bcf74780,a3ec7700,0,a3a568c0,...) at vn_write+0x1b4 dofilewrite(bcf74780,ffffffff,ffffffff,0,a5689690,...) at dofilewrite +0x97 kern_writev(a3a568c0,e,bcf74780,bcf74780,0,...) at kern_writev+0x58 writev(a3a568c0,dfc72cf8,c,9e5f0500,a3a568c0,...) at writev+0x46 syscall(dfc72d38) at syscall+0x325 Xint0x80_syscall() at Xint0x80_syscall+0x20 The one exception was a single txg_thread_enter process that appeared to be waiting on zio: Tracing command txg_thread_enter pid 2469 tid 100252 td 0xa48f4690 sched_switch(a48f4690,0,104,9cb79a82,139,...) at sched_switch+0x1fc mi_switch(104,0,a48f4690,a48f4690,dfd7ba80,...) at mi_switch+0x12a sleepq_switch(a48f4690,0,9e88745b,247,0,...) at sleepq_switch+0xcb sleepq_wait(adff56ec,0,a321cffb,1,0,...) at sleepq_wait+0x39 _cv_wait(adff56ec,adff56d8,a321cf9b,3fe,3b9aca00,...) at _cv_wait+0x188 zio_wait(adff54a8,3ec8,0,0,9f055a60,...) at zio_wait+0x62 dsl_pool_sync(a2d0a400,3ec8,0,0,0,...) at dsl_pool_sync+0x112 --More-- spa_sync(a2ff5000,3ec8,0,0,3e8,...) at spa_sync+0x3b5 txg_sync_thread(a2d0a400,dfd7bd38,0,0,0,...) at txg_sync_thread+0x3b4 fork_exit(a3182790,a2d0a400,dfd7bd38) at fork_exit+0x90 fork_trampoline() at fork_trampoline+0x8 All of the spa_zio processes were blocked waiting on task queues except for the one spa_zio process marked RUN. It appeared to have been interrupted: Tracing command spa_zio pid 2458 tid 100245 td 0xa48d1d20 sched_switch(a48d1d20,0,602,d1125c99,17e,...) at sched_switch+0x1fc mi_switch(602,0,9e884bce,bc,0,...) at mi_switch+0x12a --More-- critical_exit(9e8f29a8,a48d1d20,9e8f29a8,a2c33580,e,...) at critical_exit+0x92 intr_event_handle(a2c33580,dfd62b54,dfd62b48,c3c85000,e,...) at intr_event_handle+0xaa intr_execute_handlers (9e8f29a8,dfd62b54,c3c85000,c381b284,dfd62c10,...) at intr_execute_handlers+0x> atpic_handle_intr(e,dfd62b54) at atpic_handle_intr+0x67 Xatpic_intr14() at Xatpic_intr14+0x21 --- interrupt, eip = 0x9e821e7e, esp = 0xdfd62b94, ebp = 0xdfd62c10 --- generic_bcopy(a4c4ab68,c1757254,a321c015,134,a4c4ab2c,...) at generic_bcopy+0x1a vdev_queue_io_done(c1757254,0,a324aad8,a9d0c4c0,a9d0c4c0,...) at vdev_queue_io_done+0xb2 zio_vdev_io_done(c1757254,a31316e2,a324aad8,0,a3214fd8,...) at zio_vdev_io_done+0x72 zio_execute(c1757254,0,a3214fd8,352,6a7334a1,...) at zio_execute+0x62 taskq_thread(a324aab8,dfd62d38,0,0,0,...) at taskq_thread+0x1a0 fork_exit(a3132030,a324aab8,dfd62d38) at fork_exit+0x90 fork_trampoline() at fork_trampoline+0x8 Across all the livelocks I triggered the spa_zio process was always marked to RUN and appeared to have been interrupted. Also, if you look at the top output it is running at a lower priority than the other spinning threads. Its not clear to me why the top threads were spinning in the txg calls. The zfs code, however, does spawn the txg threads at minclsyspri and the zio threads at maxclsyspri. It seems that their intention was that the zio threads should preempt the txg processing. Currently in FreeBSD, however, these priority values are ignored. In this case, at least, it looks like the priorities may be required by the zfs design. To test this theory I applied the attached patch to sets the zfs thread priorities. I chose the priority values to be similar to PRIBIO but so that minclsyspri is still lower priority than maxclsyspri. The exact values are somewhat arbitrary. With these changes I have not been able to trigger the livelock condition again. (Also, it occurred to me that this might make the arc_reclaim_thread more responsive to system memory pressure.) I should note, however, that the livelock failure is somewhat timing dependent. When I enabled WITNESS at one point it changed the timing to the point where the livelock would only occur after running the backup for hours instead of minutes. Its possible the thread priority change has only changed the timing, but not really fixed the problem. Also, this patch might affect dtrace, but I have not tested that. Does my analysis look reasonable? Should the spa_zio thread have been rescheduled even without the priority change? Is there a better way to attack this problem? Any input would be greatly appreciated. Thank you. - Ben --Apple-Mail-46--749651897 Content-Disposition: attachment; filename=zfs_thread_priority.diff Content-Type: application/octet-stream; x-unix-mode=0644; name="zfs_thread_priority.diff" Content-Transfer-Encoding: 7bit Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c (revision 187) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c (working copy) @@ -32,6 +32,7 @@ #include #include #include +#include #include #include @@ -194,6 +195,8 @@ zio_t *zio; struct bio *bp; + sched_prio(curthread, PRIBIO); + ctx = arg; for (;;) { mtx_lock(&ctx->gc_queue_mtx); Index: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c =================================================================== --- sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c (revision 187) +++ sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c (working copy) @@ -72,6 +72,7 @@ #include #include #include +#include #include #include "zfs_namecheck.h" @@ -402,6 +403,8 @@ zvol_state_t *zv; struct bio *bp; + sched_prio(curthread, PRIBIO); + zv = arg; for (;;) { mtx_lock(&zv->zv_queue_mtx); Index: sys/cddl/compat/opensolaris/sys/proc.h =================================================================== --- sys/cddl/compat/opensolaris/sys/proc.h (revision 187) +++ sys/cddl/compat/opensolaris/sys/proc.h (working copy) @@ -35,12 +35,14 @@ #include #include #include +#include +#include #ifdef _KERNEL #define CPU curcpu -#define minclsyspri 0 -#define maxclsyspri 0 +#define minclsyspri (PRIBIO + 2) +#define maxclsyspri (PRIBIO + 1) #define max_ncpus mp_ncpus #define boot_max_ncpus mp_ncpus @@ -60,6 +62,7 @@ { proc_t *p; int error; + kthread_t *t; /* * Be sure there are no surprises. @@ -70,7 +73,13 @@ error = kproc_create(proc, arg, &p, 0, stksize / PAGE_SIZE, "solthread %p", proc); - return (error == 0 ? FIRST_THREAD_IN_PROC(p) : NULL); + if (error == 0) { + t = FIRST_THREAD_IN_PROC(p); + sched_prio(t, pri); + return t; + } else { + return NULL; + } } #define thread_exit() kproc_exit(0) --Apple-Mail-46--749651897 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit --Apple-Mail-46--749651897--