Date: Sun, 22 Mar 2009 18:03:54 +0000 From: Kris Kennaway <kris@FreeBSD.org> To: FreeBSD-stable <freebsd-stable@freebsd.org> Subject: Re: Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful? Message-ID: <49C67D8A.5070505@FreeBSD.org> In-Reply-To: <20090322093156.GE80292@sysmon.tcworks.net> References: <20090320194157.GB80292@sysmon.tcworks.net> <20090322093156.GE80292@sysmon.tcworks.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Scott Lambert wrote: > On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote: >> I have a previously stable machine, other than a one time panic in >> soft-updates which I could never reproduce, running RELENG_7 from July >> 23, 2008. >> >> Starting update: Wed Jul 23 01:29:47 CDT 2008 >> Finished update: Wed Jul 23 01:31:13 CDT 2008 >> >> I had the userquota option in the fstab for /home, but I did not yet >> have anything in /etc/rc.conf to enable them. I have been running an >> unmodified GENERIC kernel config. >> >> /dev/mirror/gm0s1g on /home (ufs, local, soft-updates) >> >> It runs a few jails, using ezjails. Two of them were image based jails, >> 1GB and 2GB. There is also one non-image file jail. The jails live in >> /home/ezjails. >> >> I added another image based jail, 3GB image, on March 12th. >> >> I added this machine to our AMANDA setup on March 13, 2009. >> >> Things seemed to be okay until the 19th. On the 19th, during the dump >> of /home, things gradually started to hang. Nagios paged me about >> services not responding. >> >> I did not find any explanation for it. The disks were idle according to >> systat -vm. I was able to grep the log files on /var for a while, and >> then I could no longer do anything with it. >> >> I eventually had to go to the office and power cycle it. I tried C-A-D >> first, but shutdown timed out after 30 seconds. >> >> Just to make sure it wasn't something that had since been fixed, I >> updated to RELENG_7 as of Mar 19th. >> >> Starting update: Thu Mar 19 03:40:41 CDT 2009 >> Finished update: Thu Mar 19 03:48:45 CDT 2009 >> >> I rebooted to the new kernel and installed the world just after midnight >> on the 20th. I started getting paged by Nagios again at 3:40am. >> >> I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77, >> as things began to circle the drain. That was about 30 minutes after >> the dump attempt had been started by AMANDA. There were many processes >> waiting in state D. This time I did a reboot -n -q and the box rebooted >> but was still fscking when I got to the office. >> >> # ls -l /home/.snap >> -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot >> >> # df /home >> Filesystem Size Used Avail Capacity Mounted on >> /dev/mirror/gm0s1g 106G 11G 86G 11% /home >> >> I removed userquota from the fstab entry for /home and rebooted, just >> to be sure. The last danger combination I remember for snapshots was >> in combination with quotas. Am I even in the danger zone for quotas >> without having them compiled into the kernel? >> >> It looks like removing the .snap directory should be enough to prevent >> any future snapshots during the backup process. Does that sound like a >> reasonable workaround? It would at least remove one variable from the >> trouble shooting process. >> >> Any other suggestions? >> >> Thank you for any help you may be able to provide, > > Did it to me again tonight. I was unable to get in to look at anything. > Just pushed the power button. It did give me the same "shutdown timed > out after 30 seconds." > > So, I tuned the /home fs to disable softupdates. I also removed the > .snap directory. > > I would appreciate any suggestions... > http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug.html Kris
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?49C67D8A.5070505>