From owner-freebsd-stable@FreeBSD.ORG Fri Mar 20 20:09:12 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 02FC71065673 for ; Fri, 20 Mar 2009 20:09:12 +0000 (UTC) (envelope-from lambert@lambertfam.org) Received: from sysmon.tcworks.net (sysmon.tcworks.net [65.66.76.4]) by mx1.freebsd.org (Postfix) with ESMTP id BE4BB8FC16 for ; Fri, 20 Mar 2009 20:09:11 +0000 (UTC) (envelope-from lambert@lambertfam.org) Received: from sysmon.tcworks.net (localhost [127.0.0.1]) by sysmon.tcworks.net (8.13.1/8.13.1) with ESMTP id n2KJfvHT014515 for ; Fri, 20 Mar 2009 14:41:57 -0500 (CDT) (envelope-from lambert@lambertfam.org) Received: (from lambert@localhost) by sysmon.tcworks.net (8.13.1/8.13.1/Submit) id n2KJfvjR014514 for freebsd-stable@freebsd.org; Fri, 20 Mar 2009 14:41:57 -0500 (CDT) (envelope-from lambert@lambertfam.org) X-Authentication-Warning: sysmon.tcworks.net: lambert set sender to lambert@lambertfam.org using -f Date: Fri, 20 Mar 2009 14:41:57 -0500 From: Scott Lambert To: FreeBSD-STABLE Message-ID: <20090320194157.GB80292@sysmon.tcworks.net> Mail-Followup-To: FreeBSD-STABLE Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.2i Subject: Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Mar 2009 20:09:12 -0000 I have a previously stable machine, other than a one time panic in soft-updates which I could never reproduce, running RELENG_7 from July 23, 2008. Starting update: Wed Jul 23 01:29:47 CDT 2008 Finished update: Wed Jul 23 01:31:13 CDT 2008 I had the userquota option in the fstab for /home, but I did not yet have anything in /etc/rc.conf to enable them. I have been running an unmodified GENERIC kernel config. /dev/mirror/gm0s1g on /home (ufs, local, soft-updates) It runs a few jails, using ezjails. Two of them were image based jails, 1GB and 2GB. There is also one non-image file jail. The jails live in /home/ezjails. I added another image based jail, 3GB image, on March 12th. I added this machine to our AMANDA setup on March 13, 2009. Things seemed to be okay until the 19th. On the 19th, during the dump of /home, things gradually started to hang. Nagios paged me about services not responding. I did not find any explanation for it. The disks were idle according to systat -vm. I was able to grep the log files on /var for a while, and then I could no longer do anything with it. I eventually had to go to the office and power cycle it. I tried C-A-D first, but shutdown timed out after 30 seconds. Just to make sure it wasn't something that had since been fixed, I updated to RELENG_7 as of Mar 19th. Starting update: Thu Mar 19 03:40:41 CDT 2009 Finished update: Thu Mar 19 03:48:45 CDT 2009 I rebooted to the new kernel and installed the world just after midnight on the 20th. I started getting paged by Nagios again at 3:40am. I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77, as things began to circle the drain. That was about 30 minutes after the dump attempt had been started by AMANDA. There were many processes waiting in state D. This time I did a reboot -n -q and the box rebooted but was still fscking when I got to the office. # ls -l /home/.snap -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot # df /home Filesystem Size Used Avail Capacity Mounted on /dev/mirror/gm0s1g 106G 11G 86G 11% /home I removed userquota from the fstab entry for /home and rebooted, just to be sure. The last danger combination I remember for snapshots was in combination with quotas. Am I even in the danger zone for quotas without having them compiled into the kernel? It looks like removing the .snap directory should be enough to prevent any future snapshots during the backup process. Does that sound like a reasonable workaround? It would at least remove one variable from the trouble shooting process. Any other suggestions? Thank you for any help you may be able to provide, -- Scott Lambert KC5MLE Unix SysAdmin lambert@lambertfam.org