From owner-freebsd-stable@FreeBSD.ORG Sun Mar 22 18:03:55 2009 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A93AF1065672 for ; Sun, 22 Mar 2009 18:03:55 +0000 (UTC) (envelope-from kris@FreeBSD.org) Received: from dhcp-172-28-76-134.eur.corp.google.com (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id E15CA8FC0A for ; Sun, 22 Mar 2009 18:03:54 +0000 (UTC) (envelope-from kris@FreeBSD.org) Message-ID: <49C67D8A.5070505@FreeBSD.org> Date: Sun, 22 Mar 2009 18:03:54 +0000 From: Kris Kennaway User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302) MIME-Version: 1.0 To: FreeBSD-stable References: <20090320194157.GB80292@sysmon.tcworks.net> <20090322093156.GE80292@sysmon.tcworks.net> In-Reply-To: <20090322093156.GE80292@sysmon.tcworks.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Mar 2009 18:03:55 -0000 Scott Lambert wrote: > On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote: >> I have a previously stable machine, other than a one time panic in >> soft-updates which I could never reproduce, running RELENG_7 from July >> 23, 2008. >> >> Starting update: Wed Jul 23 01:29:47 CDT 2008 >> Finished update: Wed Jul 23 01:31:13 CDT 2008 >> >> I had the userquota option in the fstab for /home, but I did not yet >> have anything in /etc/rc.conf to enable them. I have been running an >> unmodified GENERIC kernel config. >> >> /dev/mirror/gm0s1g on /home (ufs, local, soft-updates) >> >> It runs a few jails, using ezjails. Two of them were image based jails, >> 1GB and 2GB. There is also one non-image file jail. The jails live in >> /home/ezjails. >> >> I added another image based jail, 3GB image, on March 12th. >> >> I added this machine to our AMANDA setup on March 13, 2009. >> >> Things seemed to be okay until the 19th. On the 19th, during the dump >> of /home, things gradually started to hang. Nagios paged me about >> services not responding. >> >> I did not find any explanation for it. The disks were idle according to >> systat -vm. I was able to grep the log files on /var for a while, and >> then I could no longer do anything with it. >> >> I eventually had to go to the office and power cycle it. I tried C-A-D >> first, but shutdown timed out after 30 seconds. >> >> Just to make sure it wasn't something that had since been fixed, I >> updated to RELENG_7 as of Mar 19th. >> >> Starting update: Thu Mar 19 03:40:41 CDT 2009 >> Finished update: Thu Mar 19 03:48:45 CDT 2009 >> >> I rebooted to the new kernel and installed the world just after midnight >> on the 20th. I started getting paged by Nagios again at 3:40am. >> >> I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77, >> as things began to circle the drain. That was about 30 minutes after >> the dump attempt had been started by AMANDA. There were many processes >> waiting in state D. This time I did a reboot -n -q and the box rebooted >> but was still fscking when I got to the office. >> >> # ls -l /home/.snap >> -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot >> >> # df /home >> Filesystem Size Used Avail Capacity Mounted on >> /dev/mirror/gm0s1g 106G 11G 86G 11% /home >> >> I removed userquota from the fstab entry for /home and rebooted, just >> to be sure. The last danger combination I remember for snapshots was >> in combination with quotas. Am I even in the danger zone for quotas >> without having them compiled into the kernel? >> >> It looks like removing the .snap directory should be enough to prevent >> any future snapshots during the backup process. Does that sound like a >> reasonable workaround? It would at least remove one variable from the >> trouble shooting process. >> >> Any other suggestions? >> >> Thank you for any help you may be able to provide, > > Did it to me again tonight. I was unable to get in to look at anything. > Just pushed the power button. It did give me the same "shutdown timed > out after 30 seconds." > > So, I tuned the /home fs to disable softupdates. I also removed the > .snap directory. > > I would appreciate any suggestions... > http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html Kris