From owner-freebsd-stable@FreeBSD.ORG  Thu Sep  2 14:57:02 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E424C1065694
	for <freebsd-stable@freebsd.org>; Thu,  2 Sep 2010 14:57:02 +0000 (UTC)
	(envelope-from dan@dan.emsphone.com)
Received: from email1.allantgroup.com (email1.emsphone.com [199.67.51.115])
	by mx1.freebsd.org (Postfix) with ESMTP id 94AE68FC16
	for <freebsd-stable@freebsd.org>; Thu,  2 Sep 2010 14:57:02 +0000 (UTC)
Received: from dan.emsphone.com (dan.emsphone.com [199.67.51.101])
	by email1.allantgroup.com (8.14.0/8.14.0) with ESMTP id o82Ev0N6014279
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <freebsd-stable@freebsd.org>; Thu, 2 Sep 2010 09:57:00 -0500 (CDT)
	(envelope-from dan@dan.emsphone.com)
Received: from dan.emsphone.com (smmsp@localhost [127.0.0.1])
	by dan.emsphone.com (8.14.4/8.14.4) with ESMTP id o82Ev0Xm032509
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <freebsd-stable@freebsd.org>; Thu, 2 Sep 2010 09:57:00 -0500 (CDT)
	(envelope-from dan@dan.emsphone.com)
Received: (from dan@localhost)
	by dan.emsphone.com (8.14.4/8.14.3/Submit) id o82Ev0ht032438;
	Thu, 2 Sep 2010 09:57:00 -0500 (CDT) (envelope-from dan)
Date: Thu, 2 Sep 2010 09:57:00 -0500
From: Dan Nelson <dnelson@allantgroup.com>
To: Tim Bishop <tim@bishnet.net>
Message-ID: <20100902145700.GG5913@dan.emsphone.com>
References: <20100821220435.GA6208@carrick-users.bishnet.net>
	<20100821222429.GB73221@dan.emsphone.com>
	<20100831133556.GB45316@carrick-users.bishnet.net>
	<20100831155829.GC5913@dan.emsphone.com>
	<20100901151931.GB9224@carrick-users.bishnet.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20100901151931.GB9224@carrick-users.bishnet.net>
X-OS: FreeBSD 8.1-STABLE
User-Agent: Mutt/1.5.20 (2009-06-14)
X-Virus-Scanned: clamav-milter 0.96 at email1.allantgroup.com
X-Virus-Status: Clean
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.0.2
	(email1.allantgroup.com [199.67.51.78]);
	Thu, 02 Sep 2010 09:57:01 -0500 (CDT)
Cc: freebsd-stable@freebsd.org
Subject: Re: 8.1R ZFS almost locking up system
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 02 Sep 2010 14:57:03 -0000

In the last episode (Sep 01), Tim Bishop said:
> On Tue, Aug 31, 2010 at 10:58:29AM -0500, Dan Nelson wrote:
> > In the last episode (Aug 31), Tim Bishop said:
> > > It happened again this Saturday (clearly something in the weekly
> > > periodic run is triggering the issue).  procstat -kk shows the
> > > following for processes doing something zfs related (where zfs related
> > > means the string 'zfs' in the procstat -kk output):
> > > 
> > >     0 100084 kernel           zfs_vn_rele_task mi_switch+0x16f sleepq_wait+0x42 _sleep+0x31c taskqueue_thread_loop+0xb7 fork_exit+0x118 fork_trampoline+0xe 
> > >     5 100031 zfskern          arc_reclaim_thre mi_switch+0x16f sleepq_timedwait+0x42 _cv_timedwait+0x129 arc_reclaim_thread+0x2d1 fork_exit+0x118 fork_trampoline+0xe 
> > >     5 100032 zfskern          l2arc_feed_threa mi_switch+0x16f sleepq_timedwait+0x42 _cv_timedwait+0x129 l2arc_feed_thread+0x1be fork_exit+0x118 fork_trampoline+0xe 
> > >     5 100085 zfskern          txg_thread_enter mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_thread_wait+0x79 txg_quiesce_thread+0xb5 fork_exit+0x118 fork_trampoline+0xe 
> > >     5 100086 zfskern          txg_thread_enter mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 zio_wait+0x61 dsl_pool_sync+0xea spa_sync+0x355 txg_sync_thread+0x195 fork_exit+0x118 fork_trampoline+0xe 
> > >    17 100040 syncer           -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_synced+0x7c zil_commit+0x416 zfs_sync+0xa6 sync_fsync+0x184 sync_vnode+0x16b sched_sync+0x1c9 fork_exit+0x118 fork_trampoline+0xe 
> > >  2210 100156 syslogd          -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 vn_write+0x2d7 dofilewrite+0x85 kern_writev+0x60 writev+0x41 syscall+0x1e7 Xfast_syscall+0xe1 
> > >  3500 100177 syslogd          -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 vn_write+0x2d7 dofilewrite+0x85 kern_writev+0x60 writev+0x41 syscall+0x1e7 Xfast_syscall+0xe1 
> > >  3783 100056 syslogd          -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 vn_write+0x2d7 dofilewrite+0x85 kern_writev+0x60 writev+0x41 syscall+0x1e7 Xfast_syscall+0xe1 
> > >  4064 100165 mysqld           initial thread   mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 closef+0x3b kern_close+0x14d syscall+0x1e7 Xfast_syscall+0xe1 
> > >  4441 100224 python2.6        initial thread   mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 
> > >  4444 100227 python2.6        initial thread   mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 
> > >  4445 100228 python2.6        initial thread   mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 
> > >  4446 100229 python2.6        initial thread   mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 
> > >  4447 100089 python2.6        initial thread   mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc null_reclaim+0xbc vgonel+0x12e vrecycle+0x7d null_inactive+0x1f vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 
> > >  5352 100270 mutt             -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_synced+0x7c zil_commit+0x416 zfs_freebsd_fsync+0xd7 null_bypass+0xd3 fsync+0x161 syscall+0x1e7 Xfast_syscall+0xe1 
> > > 52686 100200 tarsnap          -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 dmu_tx_assign+0x16c zfs_inactive+0xd9 zfs_freebsd_inactive+0x1a vinactive+0x6a vputx+0x1cc vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 closef+0x3b kern_close+0x14d syscall+0x1e7 Xfast_syscall+0xe1 
> > > 59049 100207 webalizer        initial thread   mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 null_bypass+0xd3 VOP_WRITE_APV+0x141 vn_write+0x2d7 dofilewrite+0x85 kern_pwritev+0x63 pwrite+0x59 syscall+0x1e7 Xfast_syscall+0xe1 
> > > 77573 100479 perl             -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_open+0x85 zfs_freebsd_write+0x378 VOP_WRITE_APV+0xb2 null_bypass+0xd3 VOP_WRITE_APV+0x141 vn_write+0x2d7 dofilewrite+0x85 kern_writev+0x60 write+0x55 syscall+0x1e7 Xfast_syscall+0xe1 
> > > 78595 100275 zfs              -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_synced+0x7c dsl_sync_task_group_wait+0x11c dmu_objset_snapshot+0x1b8 zfs_ioc_snapshot+0x7c zfsdev_ioctl+0x8d devfs_ioctl_f+0x77 kern_ioctl+0xf6 ioctl+0xfd syscall+0x1e7 Xfast_syscall+0xe1 
> > > 81989 100596 zfs              -                mi_switch+0x16f sleepq_wait+0x42 _cv_wait+0x111 txg_wait_synced+0x7c dsl_sync_task_group_wait+0x11c dmu_objset_snapshot+0x1b8 zfs_ioc_snapshot+0x7c zfsdev_ioctl+0x8d devfs_ioctl_f+0x77 kern_ioctl+0xf6 ioctl+0xfd syscall+0x1e7 Xfast_syscall+0xe1 
> > > 
> > > I'm not sure if this shows anything useful?
> > 
> > All your userland processes are basically waiting for the kernel to
> > finish writing a ZFS transaction group to disk.  mutt has called fsync,
> > which may have been the trigger.  Usually writing a transaction group is
> > fast, though, because ZFS will batch up all the new data into one
> > contiguous block and write it at full speed to disk.  That's why I asked
> > about full filesystems before, since if your FS has been near 99%, you
> > may not have any large runs of freespace left.
> 
> Right. But I wouldn't have thought that'd be effectively terminal? It's
> not just a bit slow - the machine freezes up, sometimes for many hours
> until rebooted.
> 
> > I noticed in your original post:
> > 
> >                capacity     operations    bandwidth
> > pool         used  avail   read  write   read  write
> > ----------  -----  -----  -----  -----  -----  -----
> > pool0        117G  16.7G    248    114   865K   269K
> >   mirror     117G  16.7G    248    114   865K   269K
> >     ad4s3       -      -     43     56  2.47M   269K
> >     ad6s3       -      -     39     56  2.41M   269K
> > ----------  -----  -----  -----  -----  -----  -----
> 
> I did a scrub the other day and I noticed this same pattern (reads
> happening more on the disks and the pool).

A scrub works down the ZFS tree, and due to COW the upper levels of metadata
are likely to be scattered across the disk in small chunks, so the first 10%
or so of the scrub will be tiny random reads.  Once it gets down to actually
checking file data it should max out all your disks throughput.
 
> > # gstat
> > ...
> > L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
> >     1     48     48   3042    9.8      0      0    0.0   47.6| ad4
> >     0     38     38   2406   10.5      0      0    0.0   39.5| ad6
> > 
> > You have a pair of mirrored disks, each doing around 40% I/O load, which
> > is 80% load if a single-threaded task is driving all the I/O.  I see the
> > syncer process is also trying to write to the ZIL.  Are you running
> > something that does a lot of fsync calls (a database server for
> > example)?  Is this system an NFS server maybe?  Try setting the sysctl
> > vfs.zfs.zil_disable=1 and see if your performance improves.
> 
> I am running both MySQL and PostgreSQL in jails, but both are extremely
> lightly loaded. No NFS.
> 
> I've looked at disabling the ZIL, but it doesn't seem to be a
> recommended thing to do?

This is just for debugging, to see whether ZIL is your problem.  ZIL logs
synchronous actions (fsync() and writes to O_SYNC files) as they happen, and
is always replayed on reboot (as opposed to the rest of the txg which gets
rolled back).  

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29
 
> I've also just upgraded to 8-STABLE to see if the few ZFS updates in
> there make any difference.

-- 
	Dan Nelson
	dnelson@allantgroup.com