From owner-freebsd-fs@FreeBSD.ORG Mon Apr 23 14:38:04 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2CF4F106564A for ; Mon, 23 Apr 2012 14:38:04 +0000 (UTC) (envelope-from tdb@carrick.bishnet.net) Received: from carrick.bishnet.net (carrick.bishnet.net [IPv6:2a01:348:132:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id B58C78FC12 for ; Mon, 23 Apr 2012 14:38:03 +0000 (UTC) Received: from carrick-users.bishnet.net ([2a01:348:132:51::10]) by carrick.bishnet.net with esmtps (TLSv1:AES256-SHA:256) (Exim 4.76 (FreeBSD)) (envelope-from ) id 1SMKP4-000PKn-Vk for freebsd-fs@freebsd.org; Mon, 23 Apr 2012 15:38:11 +0100 Received: (from tdb@localhost) by carrick-users.bishnet.net (8.14.4/8.14.4/Submit) id q3NEcA0v097388 for freebsd-fs@freebsd.org; Mon, 23 Apr 2012 15:38:10 +0100 (BST) (envelope-from tdb) Date: Mon, 23 Apr 2012 15:38:10 +0100 From: Tim Bishop To: freebsd-fs@freebsd.org Message-ID: <20120423143810.GA95448@carrick-users.bishnet.net> References: <20120327181457.GC24787@carrick-users.bishnet.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120327181457.GC24787@carrick-users.bishnet.net> X-PGP-Key: 0x5AE7D984, http://www.bishnet.net/tim/tim-bishnet-net.asc X-PGP-Fingerprint: 1453 086E 9376 1A50 ECF6 AE05 7DCE D659 5AE7 D984 User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: ZFS: processes hanging when trying to access filesystems X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Apr 2012 14:38:04 -0000 Here's a comparison of top output. This shows the higher context switching. I'm not sure if this is part of the cause of the problems, or just an effect: "top -Sj -m io" last pid: 95277; load averages: 0.04, 0.11, 0.13 up 20+05:31:54 15:29:52 186 processes: 2 running, 182 sleeping, 1 stopped, 1 waiting CPU: 4.1% user, 0.0% nice, 3.6% system, 0.0% interrupt, 92.3% idle Mem: 412M Active, 488M Inact, 4685M Wired, 52M Cache, 551M Buf, 288M Free Swap: 6144M Total, 316M Used, 5828M Free, 5% Inuse PID JID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND 12 0 root 617 1 0 0 0 0 0.00% intr 11 0 root 584 1212 0 0 0 0 0.00% idle 0 0 root 322 46 0 0 0 0 0.00% kernel 3 0 root 257 1 0 0 0 0 0.00% g_up 4 0 root 175 3 0 0 0 0 0.00% g_down 13 0 root 20 0 0 0 0 0 0.00% yarrow 5 0 root 17 0 0 16 0 16 88.89% zfskern 641 0 _pflogd 4 0 0 0 0 0 0.00% pflogd last pid: 92079; load averages: 0.39, 0.22, 0.18 up 20+05:22:39 15:20:37 197 processes: 2 running, 192 sleeping, 1 stopped, 1 zombie, 1 waiting CPU: 0.0% user, 0.0% nice, 5.3% system, 1.5% interrupt, 93.2% idle Mem: 484M Active, 478M Inact, 4655M Wired, 52M Cache, 551M Buf, 257M Free Swap: 6144M Total, 316M Used, 5828M Free, 5% Inuse PID JID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND 11 0 root 3945 6837 0 0 0 0 0.00% idle 12 0 root 2130 1 0 0 0 0 0.00% intr 0 0 root 2008 99 0 0 0 0 0.00% kernel 3 0 root 1810 0 0 0 0 0 0.00% g_up 4 0 root 1486 12 0 0 0 0 0.00% g_down 13 0 root 20 2 0 0 0 0 0.00% yarrow 5 0 root 19 0 2 66 0 68 95.77% zfskern 20 0 root 9 0 0 0 0 0 0.00% g_mirror r The latter shows the machine when it's unresponsive and processes are starting to hang. Tim. On Tue, Mar 27, 2012 at 07:14:57PM +0100, Tim Bishop wrote: > I have a machine running 8-STABLE amd64 from the end of last week. I > have a problem where the machine starts to freeze up. Any process > accessing the ZFS filesystems hangs, which eventually causes more and > more processes to be spawned (cronjobs, etc, never complete). Although > the root filesystem is on UFS (the machine hosts jails on ZFS), > eventually I can't log in anymore. > > The problem occurs when the frequently used part of the ARC gets too > large. See this graph: > > http://dl.dropbox.com/u/318044/zfs_arc_utilization-day.png > > At the right of the graph things started to hang. > > At the same time I see a high amount of context switching. > > I picked a hanging process and procstat showed the following: > > PID TID COMM TDNAME KSTACK > 24787 100303 mutt - mi_switch+0x176 sleepq_wait+0x42 _cv_wait+0x129 txg_wait_open+0x85 dmu_tx_assign+0x170 zfs_inactive+0xf1 zfs_freebsd_inactive+0x1a vinactive+0x71 vputx+0x2d8 null_reclaim+0xb3 vgonel+0x119 vrecycle+0x7b null_inactive+0x1f vinactive+0x71 vputx+0x2d8 vn_close+0xa1 vn_closefile+0x5a _fdrop+0x23 > > I'm running a reduced amount of jails on the machine at the moment which > is limiting the speed at which the machine freezes up completely. I'd > like to debug this problem further, so any advice on useful information > to collect would be appreciated. > > I've had this problem on the machine before[1] but adding more RAM > allievated the issue. > > Tim. > > [1] http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058541.html -- Tim Bishop http://www.bishnet.net/tim/ PGP Key: 0x5AE7D984