From owner-freebsd-fs@FreeBSD.ORG Sun Jul 22 06:32:45 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2C3AC106566B for ; Sun, 22 Jul 2012 06:32:45 +0000 (UTC) (envelope-from rafi@ugcs.caltech.edu) Received: from fox.seas.upenn.edu (foxv6.seas.upenn.edu [IPv6:2607:f470:8:64:5ea5::e]) by mx1.freebsd.org (Postfix) with ESMTP id C6B848FC0A for ; Sun, 22 Jul 2012 06:32:44 +0000 (UTC) Received: from [192.168.1.83] (pool-173-49-61-41.phlapa.fios.verizon.net [173.49.61.41]) (authenticated bits=0) by fox.seas.upenn.edu (8.14.5/8.14.3) with ESMTP id q6M6WeGs005281 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NOT); Sun, 22 Jul 2012 02:32:40 -0400 Message-ID: <500B9E83.6030509@ugcs.caltech.edu> Date: Sun, 22 Jul 2012 02:32:35 -0400 From: Rafi Rubin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.4) Gecko/20120510 Icedove/10.0.4 MIME-Version: 1.0 To: Peter Maloney References: <500A9787.5060109@ugcs.caltech.edu> <500AC46D.309@brockmann-consult.de> In-Reply-To: <500AC46D.309@brockmann-consult.de> X-Enigmail-Version: 1.4 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.7.7855, 1.0.260, 0.0.0000 definitions=2012-07-21_05:2012-07-20, 2012-07-21, 1970-01-01 signatures=0 X-PP-Spam-Details: rule=add_spam_details policy=default score=0 spamscore=0 ipscore=0 suspectscore=2 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=6.0.2-1203120001 definitions=main-1207210422 Cc: freebsd-fs@freebsd.org Subject: Re: ZFS deadlock X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Jul 2012 06:32:45 -0000 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'll have to read those more carefully later. From what I've read so far, it looks like that happens with a single nfs client. In my case, a single client isn't a problem, and when I drop the nfsd to a single thread, many clients (of course in reality just one at a time), are safe. I have no problem reading the .zfs directory and my snapshots. I'm also not sure if the .zfs dir is actually more sensitive than anything else. I was just happy to find something that triggered this bug reliably. I plan to explore more after I setup a non-production machine for testing. We've been seeing crashes on the order of once every 2-7 days. And I doubt anyone has been using the .zfs directories (but I'm can't be sure). Also, the failure mode isn't quite the same (crashing or rebooting the server instead of just freezing one volume), but I've been changing things around to try to bisect the problem. This is just the first consistent bug that I've been able to identify. Rafi On 07/21/12 11:02, Peter Maloney wrote: > I've had the same thing happen. It only happens with Linux clients. My > workaround is to mount an empty directory on top of .zfs. I submitted a PR > about it also. And I can't reproduce it on other machines, including the > replicated backup server with all the same snapshots. It only happens on > the one production machine, possibly because it has more nfs clients > active than what I created for my tests. > > Here's my forum thread: http://forums.freebsd.org/showthread.php?t=29648 > > Here's my PR: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/168947 > > > On 07/21/2012 01:50 PM, Rafi Rubin wrote: >> I have a small server with a mirrored pair of hard drives that handles >> generally light loads for a number of linux machines over nfs v3. >> >> I've been seeing some deadlocks (and possibly worse) lately. I've >> narrowed down the freezing to a simple test run simultaneously on a >> number of the client machines: >> >> mount host:/food/bar ls bar/.zfs/snapshot >> >> When I do that, all the client machines hang on any access to the >> server. On the server, only the volume "bar" hangs, the rest are fine. >> >> This occurs even when there is no other load on the server. Also, after >> rebooting the server, the client machines eventually complete the ls and >> behave normally. >> >> >> I think this doesn't happen if the directory is cached, either in memory >> or on an L2 ssd, but I need to rerun some tests. >> >> For now the machine seems to be stable with nfsd limited to a single >> thread. >> >> >> Any help debugging would be appreciated, Rafi >> _______________________________________________ freebsd-fs@freebsd.org >> mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To >> unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > > > _______________________________________________ freebsd-fs@freebsd.org > mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-fs To > unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJQC56AAAoJEPILXytRLnK2C6sP/RtSugG8JXkG50B8U7uiBsPh UPci3oXYh9uP6XALR2Y+41wmem4Hv8BCQylKK2pkdZkU5G4vRn25xE8PTxFTw4st 27EaD8c++Bihg+djP7B+jdfGR4QTIDNpjH1sqk9GZuZ/uppjEjh5GiNjQrnskOEK u3Q6VubBNuHAZ6/QFERsy8ZCYmyWtkQ2g7g99tLDrFDYm48lEabxeE1cuS/Ck6BT IBIqtaujLEWZ8eo8/VG143a5DxMwIjF4FeZ1ingAkiCOFHQu6QEqh7dPfK8CAxg/ sdKS7e3QSi2RKBC5MuV19jQhCXo05QxewfhWtQ4go1yyD9Q733fS7WA9pqtUrLEr DU5mrqJv7U+V68R/I7a0lM+g67H0yoxnYYk6oYfjPfc3o2aSQ+Sa1cfm94BzvOVs 19wGrh1TpKZFWV0DAJqndHLSknPd7Ly6E4r6drLerTABssSXIJxCOJEAjQG8ftzJ ftvxYPydzR0hwff/LEDnNWfd4fLmi++f0zc9mrYAAerrZDAPfT+6FHCASYmU9V6+ HA47UztAsDcTkcM3uOPhgmHQI2FnB5uaBLdK/Z+q96kpSvgsKtWyu/jLlZ/KMjNh xB+hf3OZ8q3q+QoSHDERmd2AXo9ZhTfDLIVfkU10iLaNdxoXhOq8uo5iCX0nzJK2 /sqd8O9CVd91+MfF24X5 =LFIk -----END PGP SIGNATURE-----