From owner-freebsd-stable@FreeBSD.ORG Thu Nov 13 06:05:24 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D8F6B106567A for ; Thu, 13 Nov 2008 06:05:24 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA04.westchester.pa.mail.comcast.net (qmta04.westchester.pa.mail.comcast.net [76.96.62.40]) by mx1.freebsd.org (Postfix) with ESMTP id 782B78FC13 for ; Thu, 13 Nov 2008 06:05:24 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from OMTA04.westchester.pa.mail.comcast.net ([76.96.62.35]) by QMTA04.westchester.pa.mail.comcast.net with comcast id eW0l1a0070ldTLk54W5Prg; Thu, 13 Nov 2008 06:05:23 +0000 Received: from koitsu.dyndns.org ([69.181.141.110]) by OMTA04.westchester.pa.mail.comcast.net with comcast id eW5N1a0012P6wsM3QW5NP9; Thu, 13 Nov 2008 06:05:23 +0000 X-Authority-Analysis: v=1.0 c=1 a=becna3ql8DEA:10 a=0kYo9GeZAlkA:10 a=QycZ5dHgAAAA:8 a=egB4EsXT1eb4ReCG-SYA:9 a=dGiZFwbhfnmsAMRzhPYA:7 a=vLRwQLkj1NJwhr-uxlfL_IkbME8A:4 a=EoioJ0NPDVgA:10 a=LY0hPdMaydYA:10 Received: by icarus.home.lan (Postfix, from userid 1000) id E8C275C19; Wed, 12 Nov 2008 22:05:21 -0800 (PST) Date: Wed, 12 Nov 2008 22:05:21 -0800 From: Jeremy Chadwick To: David Wolfskill , Tim Bishop , Kostik Belousov , freebsd-stable@freebsd.org Message-ID: <20081113060521.GA11595@icarus.home.lan> References: <20081112175826.GD26195@carrick.bishnet.net> <20081112194735.GK47073@deviant.kiev.zoral.com.ua> <20081113004102.GD24360@carrick.bishnet.net> <20081113044200.GA10419@icarus.home.lan> <20081113050250.GR69155@bunrab.catwhisker.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20081113050250.GR69155@bunrab.catwhisker.org> User-Agent: Mutt/1.5.18 (2008-05-17) Cc: Subject: Re: System deadlock when using mksnap_ffs X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Nov 2008 06:05:24 -0000 On Wed, Nov 12, 2008 at 09:02:50PM -0800, David Wolfskill wrote: > On Wed, Nov 12, 2008 at 08:42:00PM -0800, Jeremy Chadwick wrote: > > ... > > > > On Wed, Nov 12, 2008 at 05:58:26PM +0000, Tim Bishop wrote: > > > > > I've been playing around with snapshots lately but I've got a problem on > > > > > one of my servers running 7-STABLE amd64: > > > > > > > > > > FreeBSD paladin 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #8: Mon Nov 10 20:49:51 GMT 2008 tdb@paladin:/usr/obj/usr/src/sys/PALADIN amd64 > > > > > > > > > > I run the mksnap_ffs command to take the snapshot and some time later > > > > > the system completely freezes up: > > > > > > > > > > paladin# cd /u2/.snap/ > > > > > paladin# mksnap_ffs /u2 test.1 > > > > > > > > > > It only happens on this one filesystem, though, which might be to do > > > > > with its size. It's not over the 2TB marker, but it's pretty close. It's > > > > > also backed by a hardware RAID system, although a smaller filesystem on > > > > > the same RAID has no issues. > > ... > > Then in my book, the patch didn't fix anything. :-) The system is > > still "deadlocking"; snapshot generation **should not** wedge the system > > hard like this. > > > > Also, during my own testing, I am always able to use Ctrl-T to get > > SIGINFO from the running process (mksnap_ffs). That behaviour does not > > change for me. > > > > The rest of the below information is good -- but I'm confused about > > something: is there anyone out there who can use mksnap_ffs on a > > filesystem (/usr is a good test source) and NOT experience this > > deadlocking problem? > > I hadn't ever tried until I saw your message. Granted, I'm using a > smaller file system (I doubt that I have a toital of as much as 2 TB in > all my machines combined), and I'm running i386, vs. amd64. But it ran > just fine. I wasn't able to test SIGINFO; it finished before I had a > chance. (I ran it under time(1); wall clock time was 0.91 sec.) > > > Literally *every* FreeBSD box I have root access > > to suffers from this problem, so I'm a little baffled why we end-users > > need to keep providing debugging output when it should be easy as pie > > for a developer to do "dump -0 -L -a -f /path/fs.dump /usr" and watch > > their system wedge. > > Well, I routinely use dump/restore pipelines to copy file systems > around; never had a problem with it. > > > ... > > For reference: > > freebeast(7.1-P)[9] uname -a > FreeBSD freebeast.catwhisker.org 7.1-PRERELEASE FreeBSD 7.1-PRERELEASE #127: Wed Nov 12 05:16:20 PST 2008 root@freebeast.catwhisker.org:/common/S3/obj/usr/src/sys/FREEBEAST i386 > freebeast(7.1-P)[10] ls -la > total 4 > drwxrwxr-x 2 root operator 512 Nov 12 20:53 . > drwxr-xr-x 14 root wheel 512 Jan 22 2008 .. > freebeast(7.1-P)[11] /usr/bin/time -l mksnap_ffs /S2/usr test.1 > 0.91 real 0.00 user 0.05 sys > 976 maximum resident set size > 3 average shared memory size > 627 average unshared data size > 109 average unshared stack size > 104 page reclaims > 0 page faults > 0 swaps > 1 block input operations > 230 block output operations > 0 messages sent > 0 messages received > 0 signals received > 101 voluntary context switches > 34 involuntary context switches > freebeast(7.1-P)[12] ls -la > total 1460 > drwxrwxr-x 2 root operator 512 Nov 12 20:54 . > drwxr-xr-x 14 root wheel 512 Jan 22 2008 .. > -r--r----- 1 root operator 2410791056 Nov 12 20:54 test.1 > freebeast(7.1-P)[13] David, thanks for chiming in. This is exactly what I was fearing/worried about. It would be greatly beneficial if we could figure out what triggers the slowdown for a lot of us, since for others (proof above) mksnap_ffs behaves as expected. Since I'm able to reproduce this pretty much everywhere, here's information: # df -ki /usr Filesystem 1024-blocks Used Avail Capacity iused ifree %iused Mounted on /dev/ad4s1f 163815904 3835274 146875358 3% 254864 20941934 1% /usr # cd /usr/.snap # /usr/bin/time -l mksnap_ffs /usr test.1 load: 1.90 cmd: mksnap_ffs 11719 [wdrain] 0.00u 0.07s 0% 1092k 23.25 real 0.00 user 0.00 sys 135.98 real 0.00 user 0.62 sys 1092 maximum resident set size 4 average shared memory size 1081 average unshared data size 135 average unshared stack size 101 page reclaims 0 page faults 0 swaps 895 block input operations 13444 block output operations 0 messages sent 0 messages received 0 signals received 6433 voluntary context switches 197 involuntary context switches # ls -l test.1 -r--r----- 1 root operator 173203463240 Nov 12 21:42 test.1 David's filesystem is 2GBs, while mine is 16GB. His snap takes under 1 second, yet mine takes over 2 minutes. Possibly the large deviation is explained by the amount of space used on the filesystem or the number of inodes in use? -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |