From owner-freebsd-fs@FreeBSD.ORG Thu May 1 16:19:04 2014 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 88AEA555 for ; Thu, 1 May 2014 16:19:04 +0000 (UTC) Received: from albert.catwhisker.org (mx.catwhisker.org [198.144.209.73]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 417911D96 for ; Thu, 1 May 2014 16:19:03 +0000 (UTC) Received: from albert.catwhisker.org (localhost [127.0.0.1]) by albert.catwhisker.org (8.14.8/8.14.8) with ESMTP id s41GIuWd033531; Thu, 1 May 2014 09:18:56 -0700 (PDT) (envelope-from david@albert.catwhisker.org) Received: (from david@localhost) by albert.catwhisker.org (8.14.8/8.14.8/Submit) id s41GIunE033530; Thu, 1 May 2014 09:18:56 -0700 (PDT) (envelope-from david) Date: Thu, 1 May 2014 09:18:56 -0700 From: David Wolfskill To: fs@freebsd.org Subject: SU+J: 185 processes in state "suspfs" for >8 hrs. ... not good, right? Message-ID: <20140501161856.GH1120@albert.catwhisker.org> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="T1STeyE7BzLxwmnN" Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list Reply-To: fs@freebsd.org, David Wolfskill List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 May 2014 16:19:04 -0000 --T1STeyE7BzLxwmnN Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable I'm probably abusing things somewhat, but limits are to be pushed, yeah...? :-} At work, we have some build servers, presently running FreeBSD/amd64 stable/9 @r257221. They have 2 "packages" with 6 cores each (Xeon(R) CPU X5690 @ 3.47GHz); SMT is enabled, so the scheduler sees 24 cores. The local "build space" is a RAID 5 array of 10 2TB drives with a single UFS2+SU file system on it (~15TB). The software builds are performed within a jail (that is intended to look like FreeBSD/i386 7.1-RELEASE). My test workload is to: * create a "sandbox" (by checking out the sources). * tar up the sandbox (for future iterations). * Iterate over: - Blow away the old sandbox - Unpack a new sandbox from the tarball - Enter the sandbox & perform a timed software build - exit the sandbox & record the results (exit status; elapsed time, &c.) For the "blow away the old sandbox" step, I used to just use "rm -rf" -- but because the sandbox is ... large, that tends to be rather on the slow side. So I cobbled up a shell script that essentially does: max_proc=3D$( sysctl -n kern.maxprocperuid ) max_proc=3D$(( $max_proc / 2 )) for sb in $@; do find $sb -type d -depth +$depth -prune -print0 | \ xargs -0 -n 1 -P $max_proc rm -f -r & wait rm -fr $sb done which tends to be faster, as the process is parallelized (vs., I suppose, "paralyzed" :-}). I have the use of a designated 'test machine," which I subject to my ... experiments. Based on various other events, as well as at least one suggestion from a colleague, I thought I'd try turning on soft updates journaling -- so I did. My first set of tests were inconclusive -- I saw the load averages increase quite a bit (from a max of ~18 to ~25); some build times were around the same, while one was quite a bit longer. It's possible that one of my colleagues was doing something on the machine, though I had tried to let them know that I was running timing tests and things were a bit more "experimental" than usual. So I fired off another round of tests yesterday evening. This morning, load average was around 0 (and plenty of time had elapsed), so I thought maybe the 2nd round of tests had completed. I was mistaken in that belief. :-/ On resuming the tmux session, I found: =2E.. [SU+J], iteration 5, terminated status 0 at Thu May 1 00:01:06 PDT 2= 014 So I hit ^T and saw: load: 0.15 cmd: rm 73825 [suspfs] 16479.66r 0.00u 1.38s 0% 1436k Hmmm... So I waited a few minutes and hit ^T again and ... nothing. No response. Hmm... I was able to get a response from another session within the jail, as well as on the host. It's now been almost 5 more hours after that, and logging in to the host, I see: test-machine(9.2)[6] ps axwwl | grep -cw suspfs 185 test-machine(9.2)[7]=20 I suspect that there may have been a bad interaction between what I was doing and some cron-initiated activity that started just after midnight. But this particular mode of failure seems .... well, "graceless" comes to mind (if I want to be quite charitable). I think I'd like to file a PR on this, but I'd like to provide some decent information with the PR -- and I'd like to be able to do Other Things with the test machine in the interim. So: does anyone have suggestions for information I might gather while the machine is in this state that might help figure out a way to prevent a recurrence? Thanks! [Please note that I've set Reply-To to include both the list and me, as I'm not subscribed.] Peace, david --=20 David H. Wolfskill david@catwhisker.org Taliban: Evil cowards with guns afraid of truth from a 14-year old girl. See http://www.catwhisker.org/~david/publickey.gpg for my public key. --T1STeyE7BzLxwmnN Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (FreeBSD) iQJ8BAEBCgBmBQJTYnPtXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXQ4RThEMDY4QTIxMjc1MDZFRDIzODYzRTc4 QTY3RjlDOERFRjQxOTNCAAoJEIpn+cje9Bk7M/AP/3RvXjTi020uxHOYejS6j06q JbFOb1Zx+TUre5wneRpwugA/1QFKWHYdqdkLPaN9Wvo2vPMcozw+kaT03/nVR6vK CPbY6YMFu0ZkSgJxXYGiEV1+p5IU8uEfUN/28I4ih50LhZRba/b+B5/rJ0dKhJeB r/ArgVKvLNxojnIjo4v6pd5UOTzXGq47/IoQU2AGk3n/KEEWGMaLdl1JXixHdybm GRVazAGW52Oacx1SvlN0+6JlXob+D3fnfMvEbpASg/Wc0z2Ap8nhpDcHdL8irgMv sWmSd7nzRVW1OUU10GyXEsrViYce3sBwxQquV5HJzp915Pwls9J/9M2VplT0Blet al/UNwtlAeSY9uV3RVVOFSCeSyX+KiYnhRLyeQ5Yx4pEpFacH1lzoxLs7bwJmIsU 2isLa7TQybanqT0tCJA9IZrSQbJCvJbm9igbO3JzQI3i7mkfTwGP7Y8T4zt9TxrS OA9VxVhotYjUfcz67/JUnGpxUuJsHOcwiXmDETaOdIohFpjct50QVjJeirR3Diee 0TtOfwcqyxBrqPugOrMHkL04vQVyLnUPoHoee8tN2Ng9+k/ai+H+jn6ncEXq5hU3 FRQ8jQG/jp9y4mvpf9cHq5wdQOwG6IL6y0QypgzrkMeCBBuu+dta+0AYg8T03adi gXZ4mvpNaLwjLdcIibZQ =mzUh -----END PGP SIGNATURE----- --T1STeyE7BzLxwmnN--