Date: Sun, 22 Mar 2009 04:31:56 -0500 From: Scott Lambert <lambert@lambertfam.org> To: FreeBSD-stable <freebsd-stable@freebsd.org> Subject: Re: Is some combination of gmirror, md file systems, snapshots and, maybe, quotas considered harmful? Message-ID: <20090322093156.GE80292@sysmon.tcworks.net> In-Reply-To: <20090320194157.GB80292@sysmon.tcworks.net> References: <20090320194157.GB80292@sysmon.tcworks.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Mar 20, 2009 at 02:41:57PM -0500, Scott Lambert wrote: > I have a previously stable machine, other than a one time panic in > soft-updates which I could never reproduce, running RELENG_7 from July > 23, 2008. > > Starting update: Wed Jul 23 01:29:47 CDT 2008 > Finished update: Wed Jul 23 01:31:13 CDT 2008 > > I had the userquota option in the fstab for /home, but I did not yet > have anything in /etc/rc.conf to enable them. I have been running an > unmodified GENERIC kernel config. > > /dev/mirror/gm0s1g on /home (ufs, local, soft-updates) > > It runs a few jails, using ezjails. Two of them were image based jails, > 1GB and 2GB. There is also one non-image file jail. The jails live in > /home/ezjails. > > I added another image based jail, 3GB image, on March 12th. > > I added this machine to our AMANDA setup on March 13, 2009. > > Things seemed to be okay until the 19th. On the 19th, during the dump > of /home, things gradually started to hang. Nagios paged me about > services not responding. > > I did not find any explanation for it. The disks were idle according to > systat -vm. I was able to grep the log files on /var for a while, and > then I could no longer do anything with it. > > I eventually had to go to the office and power cycle it. I tried C-A-D > first, but shutdown timed out after 30 seconds. > > Just to make sure it wasn't something that had since been fixed, I > updated to RELENG_7 as of Mar 19th. > > Starting update: Thu Mar 19 03:40:41 CDT 2009 > Finished update: Thu Mar 19 03:48:45 CDT 2009 > > I rebooted to the new kernel and installed the world just after midnight > on the 20th. I started getting paged by Nagios again at 3:40am. > > I noticed that mksnap_ffs was running on /home, cpu time used: 0:00.77, > as things began to circle the drain. That was about 30 minutes after > the dump attempt had been started by AMANDA. There were many processes > waiting in state D. This time I did a reboot -n -q and the box rebooted > but was still fscking when I got to the office. > > # ls -l /home/.snap > -r-------- 1 root operator 117285093376 Mar 20 03:18 dump_snapshot > > # df /home > Filesystem Size Used Avail Capacity Mounted on > /dev/mirror/gm0s1g 106G 11G 86G 11% /home > > I removed userquota from the fstab entry for /home and rebooted, just > to be sure. The last danger combination I remember for snapshots was > in combination with quotas. Am I even in the danger zone for quotas > without having them compiled into the kernel? > > It looks like removing the .snap directory should be enough to prevent > any future snapshots during the backup process. Does that sound like a > reasonable workaround? It would at least remove one variable from the > trouble shooting process. > > Any other suggestions? > > Thank you for any help you may be able to provide, Did it to me again tonight. I was unable to get in to look at anything. Just pushed the power button. It did give me the same "shutdown timed out after 30 seconds." So, I tuned the /home fs to disable softupdates. I also removed the .snap directory. I would appreciate any suggestions... -- Scott Lambert KC5MLE Unix SysAdmin lambert@lambertfam.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090322093156.GE80292>