From owner-freebsd-fs@FreeBSD.ORG Sat Oct 9 14:34:42 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 86AED1065675 for ; Sat, 9 Oct 2010 14:34:42 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta09.westchester.pa.mail.comcast.net (qmta09.westchester.pa.mail.comcast.net [76.96.62.96]) by mx1.freebsd.org (Postfix) with ESMTP id 340148FC1D for ; Sat, 9 Oct 2010 14:34:41 +0000 (UTC) Received: from omta01.westchester.pa.mail.comcast.net ([76.96.62.11]) by qmta09.westchester.pa.mail.comcast.net with comcast id GdLe1f0040EZKEL59eaiqW; Sat, 09 Oct 2010 14:34:42 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta01.westchester.pa.mail.comcast.net with comcast id Geah1f0043LrwQ23MeahPC; Sat, 09 Oct 2010 14:34:42 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id D52B69B418; Sat, 9 Oct 2010 07:34:39 -0700 (PDT) Date: Sat, 9 Oct 2010 07:34:39 -0700 From: Jeremy Chadwick To: Kai Gallasch Message-ID: <20101009143439.GA63604@icarus.home.lan> References: <39F05641-4E46-4BE0-81CA-4DEB175A5FBE@free.de> <20101009111241.GA58948@icarus.home.lan> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20101009111241.GA58948@icarus.home.lan> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: Locked up processes after upgrade to ZFS v15 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Oct 2010 14:34:42 -0000 On Sat, Oct 09, 2010 at 04:12:42AM -0700, Jeremy Chadwick wrote: > [...snipping for brevity...] > > We're seeing (what we believe) to be this exact situation. Because of Kai's follow-up, I've decided to move our RELENG_8 system to using gmirror(8) instead of ZFS. The system has only been up for 4.5 hours. Upon shutting down httpd, I checked ps/top to see what things looked like. There are tons of httpd processes (now with ppid of init, indicating they're to be reaped but can't be) in "zfs" and "zfsmrb" states, and these processes are hung/stuck. So yes, the behaviour I'm seeing is identical to Kai's, and the problem is easily reproducible. I have much less RAM than he does (8GB) as well, so it isn't a RAM thing. Below is some simple debugging information for whoever wants to take this on. If someone wants to figure this out, I'll need time to build a test/debug system that mimics our production system. The UIDs differ (80, 1014, 1004, etc.) because of the Apache ITK MPM that we use. I doubt that has anything to do with the problem. horus# /usr/local/etc/rc.d/apache22 stop Stopping apache22. Waiting for PIDS: 1192. horus# ps -auxw | grep 1192 | grep -v grep horus# ps -axlH | grep httpd 1014 1243 1 0 44 0 97944 15068 zfs D ?? 0:00.00 /usr/local/sbin/httpd 1014 1462 1 0 44 0 98008 15280 zfsmrb DL ?? 0:00.00 /usr/local/sbin/httpd 1014 1585 1 0 44 0 97944 15068 zfs D ?? 0:00.00 /usr/local/sbin/httpd 1014 1633 1 0 44 0 97944 15068 zfs D ?? 0:00.00 /usr/local/sbin/httpd 1004 1805 1 0 44 0 97948 15236 zfsmrb DL ?? 0:00.00 /usr/local/sbin/httpd 1014 1998 1 0 44 0 97944 15068 zfs D ?? 0:00.00 /usr/local/sbin/httpd 1014 2038 1 0 44 0 97944 15064 zfs D ?? 0:00.00 /usr/local/sbin/httpd 1014 2077 1 0 44 0 97944 15068 zfs D ?? 0:00.00 /usr/local/sbin/httpd 80 3186 1 0 45 0 97984 15432 zfsmrb DL ?? 0:00.01 /usr/local/sbin/httpd 80 4806 1 0 44 0 97944 15128 zfs D ?? 0:00.00 /usr/local/sbin/httpd ... horus# procstat -k -k 3186 PID TID COMM TDNAME KSTACK 3186 100443 httpd - mi_switch+0x176 sleepq_wait+0x3b _sleep+0x322 zfs_freebsd_read+0x26c vnode_pager_generic_getpages+0x454 vnode_pager_getpages+0x8e vm_fault+0xbd1 trap_pfault+0x111 trap+0x479 calltrap+0x8 horus# procstat -k -k 4806 PID TID COMM TDNAME KSTACK 4806 100475 httpd - mi_switch+0x176 sleepq_wait+0x3b __lockmgr_args+0x642 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x44 vget+0x67 cache_lookup+0x4fd vfs_cache_lookup+0xad VOP_LOOKUP_APV+0x40 lookup+0x48a namei+0x518 vn_open_cred+0x390 kern_openat+0x165 syscall+0x1cd Xfast_syscall+0xe2 horus# procstat -k -k 2077 PID TID COMM TDNAME KSTACK 2077 100339 httpd - mi_switch+0x176 sleepq_wait+0x3b __lockmgr_args+0x642 vop_stdlock+0x39 VOP_LOCK1_APV+0x46 _vn_lock+0x44 vget+0x67 cache_lookup+0x4fd vfs_cache_lookup+0xad VOP_LOOKUP_APV+0x40 lookup+0x48a namei+0x518 vn_open_cred+0x390 kern_openat+0x165 syscall+0x1cd Xfast_syscall+0xe2 horus# procstat -k -k 1462 PID TID COMM TDNAME KSTACK 1462 100157 httpd - mi_switch+0x176 sleepq_wait+0x3b _sleep+0x322 zfs_freebsd_read+0x26c vnode_pager_generic_getpages+0x454 vnode_pager_getpages+0x8e vm_fault+0xbd1 trap_pfault+0x111 trap+0x479 calltrap+0x8 -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |