From owner-freebsd-fs@FreeBSD.ORG Mon Oct 11 18:37:10 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 544561065679 for ; Mon, 11 Oct 2010 18:37:10 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta02.westchester.pa.mail.comcast.net (qmta02.westchester.pa.mail.comcast.net [76.96.62.24]) by mx1.freebsd.org (Postfix) with ESMTP id F2BFD8FC1E for ; Mon, 11 Oct 2010 18:37:09 +0000 (UTC) Received: from omta13.westchester.pa.mail.comcast.net ([76.96.62.52]) by qmta02.westchester.pa.mail.comcast.net with comcast id HQKA1f00117dt5G52WdAXW; Mon, 11 Oct 2010 18:37:10 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta13.westchester.pa.mail.comcast.net with comcast id HWd81f00Q3LrwQ23ZWd9RC; Mon, 11 Oct 2010 18:37:10 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 6D5339B418; Mon, 11 Oct 2010 11:37:07 -0700 (PDT) Date: Mon, 11 Oct 2010 11:37:07 -0700 From: Jeremy Chadwick To: Andriy Gapon Message-ID: <20101011183707.GA13925@icarus.home.lan> References: <39F05641-4E46-4BE0-81CA-4DEB175A5FBE@free.de> <20101009111241.GA58948@icarus.home.lan> <4CB17983.3020907@icyb.net.ua> <20101011151508.GA10917@icarus.home.lan> <4CB32C75.2060000@icyb.net.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4CB32C75.2060000@icyb.net.ua> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: Locked up processes after upgrade to ZFS v15 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Oct 2010 18:37:10 -0000 On Mon, Oct 11, 2010 at 06:25:41PM +0300, Andriy Gapon wrote: > on 11/10/2010 18:15 Jeremy Chadwick said the following: > > On Sun, Oct 10, 2010 at 11:29:55AM +0300, Andriy Gapon wrote: > >> on 09/10/2010 16:37 Kai Gallasch said the following: > >>> I must repeat. I offer my help if someone wants to dig into the locking problem. > >> > >> I would like to look into this. > >> Can you provide shell access to a system that exhibits the behavior? > >> Or even better serial console for remote debugging, if possible. > >> > >> Also, can you first try with the very latest stable/8 or even head? > >> I have recently MFC-ed a few improvements to ZFS code. > > > > Andriy, > > > > If you need or want a secondary test box (with serial console access and > > kernel debugger, just not remote gdb/kgdb), I have one available which I > > can build (would take me only part of the morning) mimicking production. > > Let me know if you want/need a secondary test bed. > > Jeremy, > > were are in a process of debugging this issue with Kai and I think that we are > onto something. I would like to ask you for some additional testing. No problem. I'm in the process of setting up my testbed box now, and will be putting RELENG_8 on it + all the ports/configuration details that we use in production. Now to talk about the system which was seeing the zfs/zfsmrb problem... That system is production and has to remain usable/up. That machine has had ZFS removed from it entirely and now uses gmirror. Sorry, just one of those things where service uptime has priority. > What kind of workload do your run? > Do you have anything using sendfile(2)? E.g. Apache with EnableSendfile enabled > (it might be by default, without explicit options). The aforementioned system is a multi-role box: - Primary front-end webserver -- Apache 2.2.16 with ITK MPM (prefork) -- Apache is built with threading disabled -- PHP 5.3.3 is used - Primary DNS server (master, not slave) -- Using base system BIND, nothing crazy in the configuration - Primary mail server -- Using postfix - Primary shell and FTP server -- Using base system OpenSSH and base system ftpd Hardware-wise, the system is: - SYS: Supermicro SuperServer 5015M-T+B http://www.supermicro.com/products/system/1U/5015/SYS-5015M-T_.cfm - CPU: Intel Core 2 Duo E6420 (2.13GHz, 4MB cache, 1066FSB) - RAM: 8GB ECC (two 2GB pairs) - Disk ada0: 320GB, SATA300: OS disk and partial ZFS disk (mirror) - Disk ada1: 250GB, SATA300: ZFS disk (mirror) Filesystem layout: ada0s1a = 1GB = UFS2 = / ada0s1b = 16GB = swap ada0s1d = 16GB = UFS2+SU = /var ada0s1e = 4GB = UFS2+SU = /tmp ada0s1f = 8GB = UFS2+SU = /usr ada0s1g = 275GB = ZFS pool (mirror) ada1 = 320GB = ZFS pool (mirror) The ZFS mirror was therefore ~275GB in size (since ada0s1g was the smaller of the two devices in the pool). Only two ZFS filesystems were created: /home and /var/mail. All filesystem settings were default (no compression, etc.). The original ZFS pool and filesystems were created on ZFS v14 and upgraded using "zpool upgrade" and "zfs upgrade" after a fresh RELENG_8 OS was installed. The OS was completely reinstalled (not in-place upgraded), having previously run RELENG_7 (uptime: 221 days). I'd say 80% of the I/O happening on the box was induced either httpd and postfix/SpamAssassin. Regarding Apache: We do use sendfile. Here are the two settings in our httpd.conf which we use that may be relevant to ZFS: EnableMMAP on EnableSendfile on > Can you try to disable sendfile(2) use, reboot and see how system behaves? > If you still experience the same problem after doing the above, then I'd like to > ask you for shell access with root privileges; or establishing communication via > IM and running some commands for me. Well, the system is using gmirror + UFS2 (without SU) now, so I can't disable sendfile to see how it behaves. I'll let you do that on the testbed box once I get it up + configured. You'll have root-level access (both via serial console and SSH) to it, and you can reboot it as you please. If the testbed box explodes or bursts into flame, no biggie -- it's there for you to bang on. :-) The testbed box does have significantly different hardware and much less RAM (only 1GB), so it's not identical in capability. However, given the nature of the problem, I would think reproducing it would be easy, especially since Kai's box is significantly different from mine and we both saw the same problem. I'll let you know here in the thread when things are available, and then we can communicate privately (get your SSH public key, etc.). -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |