From owner-freebsd-stable@FreeBSD.ORG Sun Mar 22 09:31:57 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 519EF1065675 for ; Sun, 22 Mar 2009 09:31:57 +0000 (UTC) (envelope-from lambert@lambertfam.org) Received: from sysmon.tcworks.net (sysmon.tcworks.net [65.66.76.4]) by mx1.freebsd.org (Postfix) with ESMTP id DAE5E8FC1F for ; Sun, 22 Mar 2009 09:31:56 +0000 (UTC) (envelope-from lambert@lambertfam.org) Received: from sysmon.tcworks.net (localhost [127.0.0.1]) by sysmon.tcworks.net (8.13.1/8.13.1) with ESMTP id n2M9VuN7013789 for ; Sun, 22 Mar 2009 04:31:56 -0500 (CDT) (envelope-from lambert@lambertfam.org) Received: (from lambert@localhost) by sysmon.tcworks.net (8.13.1/8.13.1/Submit) id n2M9VuZC013788 for freebsd-stable@freebsd.org; Sun, 22 Mar 2009 04:31:56 -0500 (CDT) (envelope-from lambert@lambertfam.org) X-Authentication-Warning: sysmon.tcworks.net: lambert set sender to lambert@lambertfam.org using -f Date: Sun, 22 Mar 2009 04:31:56 -0500 From: Scott Lambert To: FreeBSD-stable Message-ID: <20090322093156.GE80292@sysmon.tcworks.net> Mail-Followup-To: FreeBSD-stable References: <20090320194157.GB80292@sysmon.tcworks.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20090320194157.GB80292@sysmon.tcworks.net> User-Agent: Mutt/1.4.2.2i Subject: Re: Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Mar 2009 09:31:57 -0000 On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote: > I have a previously stable machine, other than a one time panic in > soft-updates which I could never reproduce, running RELENG_7 from July > 23, 2008. > > Starting update: Wed Jul 23 01:29:47 CDT 2008 > Finished update: Wed Jul 23 01:31:13 CDT 2008 > > I had the userquota option in the fstab for /home, but I did not yet > have anything in /etc/rc.conf to enable them. I have been running an > unmodified GENERIC kernel config. > > /dev/mirror/gm0s1g on /home (ufs, local, soft-updates) > > It runs a few jails, using ezjails. Two of them were image based jails, > 1GB and 2GB. There is also one non-image file jail. The jails live in > /home/ezjails. > > I added another image based jail, 3GB image, on March 12th. > > I added this machine to our AMANDA setup on March 13, 2009. > > Things seemed to be okay until the 19th. On the 19th, during the dump > of /home, things gradually started to hang. Nagios paged me about > services not responding. > > I did not find any explanation for it. The disks were idle according to > systat -vm. I was able to grep the log files on /var for a while, and > then I could no longer do anything with it. > > I eventually had to go to the office and power cycle it. I tried C-A-D > first, but shutdown timed out after 30 seconds. > > Just to make sure it wasn't something that had since been fixed, I > updated to RELENG_7 as of Mar 19th. > > Starting update: Thu Mar 19 03:40:41 CDT 2009 > Finished update: Thu Mar 19 03:48:45 CDT 2009 > > I rebooted to the new kernel and installed the world just after midnight > on the 20th. I started getting paged by Nagios again at 3:40am. > > I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77, > as things began to circle the drain. That was about 30 minutes after > the dump attempt had been started by AMANDA. There were many processes > waiting in state D. This time I did a reboot -n -q and the box rebooted > but was still fscking when I got to the office. > > # ls -l /home/.snap > -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot > > # df /home > Filesystem Size Used Avail Capacity Mounted on > /dev/mirror/gm0s1g 106G 11G 86G 11% /home > > I removed userquota from the fstab entry for /home and rebooted, just > to be sure. The last danger combination I remember for snapshots was > in combination with quotas. Am I even in the danger zone for quotas > without having them compiled into the kernel? > > It looks like removing the .snap directory should be enough to prevent > any future snapshots during the backup process. Does that sound like a > reasonable workaround? It would at least remove one variable from the > trouble shooting process. > > Any other suggestions? > > Thank you for any help you may be able to provide, Did it to me again tonight. I was unable to get in to look at anything. Just pushed the power button. It did give me the same "shutdown timed out after 30 seconds." So, I tuned the /home fs to disable softupdates. I also removed the .snap directory. I would appreciate any suggestions... -- Scott Lambert KC5MLE Unix SysAdmin lambert@lambertfam.org