From owner-freebsd-stable@FreeBSD.ORG Wed Sep 1 15:19:34 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2DF721065674 for ; Wed, 1 Sep 2010 15:19:34 +0000 (UTC) (envelope-from tdb@carrick.bishnet.net) Received: from carrick.bishnet.net (carrick.bishnet.net [IPv6:2a01:348:132:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id B64AF8FC16 for ; Wed, 1 Sep 2010 15:19:33 +0000 (UTC) Received: from [2a01:348:132:51::10] (helo=carrick-users) by carrick.bishnet.net with esmtps (TLSv1:AES256-SHA:256) (Exim 4.72 (FreeBSD)) (envelope-from ) id 1Oqp63-0003Yn-R3; Wed, 01 Sep 2010 16:19:31 +0100 Received: (from tdb@localhost) by carrick-users (8.14.4/8.14.4/Submit) id o81FJVqk013688; Wed, 1 Sep 2010 16:19:31 +0100 (BST) (envelope-from tdb) Date: Wed, 1 Sep 2010 16:19:31 +0100 From: Tim Bishop To: Dan Nelson Message-ID: <20100901151931.GB9224@carrick-users.bishnet.net> References: <20100821220435.GA6208@carrick-users.bishnet.net> <20100821222429.GB73221@dan.emsphone.com> <20100831133556.GB45316@carrick-users.bishnet.net> <20100831155829.GC5913@dan.emsphone.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100831155829.GC5913@dan.emsphone.com> X-PGP-Key: 0x5AE7D984, http://www.bishnet.net/tim/tim-bishnet-net.asc X-PGP-Fingerprint: 1453 086E 9376 1A50 ECF6 AE05 7DCE D659 5AE7 D984 User-Agent: Mutt/1.5.20 (2009-06-14) Cc: freebsd-stable@freebsd.org Subject: Re: 8.1R ZFS almost locking up system X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Sep 2010 15:19:34 -0000 On Tue, Aug 31, 2010 at 10:58:29AM -0500, Dan Nelson wrote: > In the last episode (Aug 31), Tim Bishop said: > > On Sat, Aug 21, 2010 at 05:24:29PM -0500, Dan Nelson wrote: > > > In the last episode (Aug 21), Tim Bishop said: > > > > A few items from top, including zfskern: > > > > > > > > PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND > > > > 5 root 4 -8 - 0K 60K zio->i 0 54:38 3.47% zfskern > > > > 91775 70 1 44 0 53040K 31144K tx->tx 1 2:11 0.00% postgres > > > > 39661 tdb 1 44 0 55776K 32968K tx->tx 0 0:39 0.00% mutt > > > > 14828 root 1 47 0 14636K 1572K tx->tx 1 0:03 0.00% zfs > > > > 11188 root 1 51 0 14636K 1572K tx->tx 0 0:03 0.00% zfs > > > > > > > > At some point during this process my zfs snapshots have been failing to > > > > complete: > > > > > > > > root 5 0.8 0.0 0 60 ?? DL 7Aug10 54:43.83 [zfskern] > > > > root 8265 0.0 0.0 14636 1528 ?? D 10:00AM 0:03.12 zfs snapshot -r pool0@2010-08-21_10:00:01--1d > > > > root 11188 0.0 0.1 14636 1572 ?? D 11:00AM 0:02.93 zfs snapshot -r pool0@2010-08-21_11:00:01--1d > > > > root 14828 0.0 0.1 14636 1572 ?? D 12:00PM 0:03.04 zfs snapshot -r pool0@2010-08-21_12:00:00--1d > > > > root 17862 0.0 0.1 14636 1572 ?? D 1:00PM 0:01.96 zfs snapshot -r pool0@2010-08-21_13:00:01--1d > > > > root 20986 0.0 0.1 14636 1572 ?? D 2:00PM 0:02.07 zfs snapshot -r pool0@2010-08-21_14:00:01--1d > > > > > > procstat -k on some of these processes might help to pinpoint what part of > > > the zfs code they're all waiting in. > > > > It happened again this Saturday (clearly something in the weekly > > periodic run is triggering the issue). procstat -kk shows the following > > for processes doing something zfs related (where zfs related means the > > string 'zfs' in the procstat -kk output): > > > > 0 100084 kernel zfs_vn_rele_task mi_switch+0x16f sleepq_wait+0x42 _sleep+0x31c taskqueue_thread_loop+0xb7 fork_exit+0x118 fork_trampoline+0xe > > 5 100031 zfskern arc_reclaim_thre mi_switch+0x16f sleepq_timedwait+0x42 _cv_timedwait+0x129 arc_reclaim_thread+0x2d1 fork_exit+0x118 fork_trampoline+0xe > > 5 100032 zfskern l2arc_feed_threa mi_switch+0x16f sleepq_timedwait+0x42 _cv_timedwait+0x129 l2arc_feed_thread+0x1be fork_exit+0x118 fork_trampoline+0xe > > 5 100085 zfskern txg_thread_enter mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x118 fork_trampoline+0xe > > 5 100086 zfskern txg_thread_enter mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 zio_wait+0x61 dsl_pool_sync+0xea spa_sync+0x355 txg_sync_thread+0x195 fork_exit+0x118 fork_trampoline+0xe > > 17 100040 syncer - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_synced+0x7c zil_commit+0x416 zfs_sync+0xa6 sync_fsync+0x184 sync_vnode+0x16b sched_sync+0x1c9 fork_exit+0x118 fork_trampoline+0xe > > 2210 100156 syslogd - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 vn_write+0x2d7 dofilewrite+0x85 kern_writev+0x60 writev+0x41 syscall+0x1e7 Xfast_syscall+0xe1 > > 3500 100177 syslogd - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 vn_write+0x2d7 dofilewrite+0x85 kern_writev+0x60 writev+0x41 syscall+0x1e7 Xfast_syscall+0xe1 > > 3783 100056 syslogd - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 vn_write+0x2d7 dofilewrite+0x85 kern_writev+0x60 writev+0x41 syscall+0x1e7 Xfast_syscall+0xe1 > > 4064 100165 mysqld initial thread mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 closef+0x3b kern_close+0x14d syscall+0x1e7 Xfast_syscall+0xe1 > > 4441 100224 python2.6 initial thread mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 > > 4444 100227 python2.6 initial thread mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 > > 4445 100228 python2.6 initial thread mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 > > 4446 100229 python2.6 initial thread mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 > > 4447 100089 python2.6 initial thread mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 > > 5352 100270 mutt - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_synced+0x7c zil_commit+0x416 zfs_freebsd_fsync+0xd7 null_bypass+0xd3 fsync+0x161 syscall+0x1e7 Xfast_syscall+0xe1 > > 52686 100200 tarsnap - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 closef+0x3b kern_close+0x14d syscall+0x1e7 Xfast_syscall+0xe1 > > 59049 100207 webalizer initial thread mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 null_bypass+0xd3 VOP_WRITE_APV+0x141 vn_write+0x2d7 dofilewrite+0x85 kern_pwritev+0x63 pwrite+0x59 syscall+0x1e7 Xfast_syscall+0xe1 > > 77573 100479 perl - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 null_bypass+0xd3 VOP_WRITE_APV+0x141 vn_write+0x2d7 dofilewrite+0x85 kern_writev+0x60 write+0x55 syscall+0x1e7 Xfast_syscall+0xe1 > > 78595 100275 zfs - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_synced+0x7c dsl_sync_task_group_wait+0x11c dmu_objset_snapshot+0x1b8 zfs_ioc_snapshot+0x7c zfsdev_ioctl+0x8d devfs_ioctl_f+0x77 kern_ioctl+0xf6 ioctl+0xfd syscall+0x1e7 Xfast_syscall+0xe1 > > 81989 100596 zfs - mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_synced+0x7c dsl_sync_task_group_wait+0x11c dmu_objset_snapshot+0x1b8 zfs_ioc_snapshot+0x7c zfsdev_ioctl+0x8d devfs_ioctl_f+0x77 kern_ioctl+0xf6 ioctl+0xfd syscall+0x1e7 Xfast_syscall+0xe1 > > > > I'm not sure if this shows anything useful? > > All your userland processes are basically waiting for the kernel to finish > writing a ZFS transaction group to disk. mutt has called fsync, which may > have been the trigger. Usually writing a transaction group is fast, though, > because ZFS will batch up all the new data into one contiguous block and > write it at full speed to disk. That's why I asked about full filesystems > before, since if your FS has been near 99%, you may not have any large runs > of freespace left. Right. But I wouldn't have thought that'd be effectively terminal? It's not just a bit slow - the machine freezes up, sometimes for many hours until rebooted. > I noticed in your original post: > > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > pool0 117G 16.7G 248 114 865K 269K > mirror 117G 16.7G 248 114 865K 269K > ad4s3 - - 43 56 2.47M 269K > ad6s3 - - 39 56 2.41M 269K > ---------- ----- ----- ----- ----- ----- ----- I did a scrub the other day and I noticed this same pattern (reads happening more on the disks and the pool). > # gstat > ... > L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name > 1 48 48 3042 9.8 0 0 0.0 47.6| ad4 > 0 38 38 2406 10.5 0 0 0.0 39.5| ad6 > > You have a pair of mirrored disks, each doing around 40% I/O load, which is > 80% load if a single-threaded task is driving all the I/O. I see the syncer > process is also trying to write to the ZIL. Are you running something that > does a lot of fsync calls (a database server for example)? Is this system > an NFS server maybe? Try setting the sysctl vfs.zfs.zil_disable=1 and see > if your performance improves. I am running both MySQL and PostgreSQL in jails, but both are extremely lightly loaded. No NFS. I've looked at disabling the ZIL, but it doesn't seem to be a recommended thing to do? I've also just upgraded to 8-STABLE to see if the few ZFS updates in there make any difference. Tim. -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984