From owner-freebsd-stable@FreeBSD.ORG Sun Aug 25 07:16:40 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 2BD2353B for ; Sun, 25 Aug 2013 07:16:40 +0000 (UTC) (envelope-from prvs=0949e860e2=michael@esosoft.com) Received: from eagle.esosoft.net (eagle.esosoft.net [66.241.144.8]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 156992D08 for ; Sun, 25 Aug 2013 07:16:39 +0000 (UTC) Received: from [74.100.23.197] (port=41089 helo=michaelimac.castillodelsol.com) by eagle.esosoft.net with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1VDUAs-0005uJ-BD; Sat, 24 Aug 2013 23:51:46 -0700 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: NFS deadlock on 9.2-Beta1 From: Michael Tratz In-Reply-To: <461392652.9990692.1376602743970.JavaMail.root@uoguelph.ca> Date: Sat, 24 Aug 2013 23:51:45 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <40674FAC-33E6-4994-819E-6B8318B9DDB3@esosoft.com> References: <461392652.9990692.1376602743970.JavaMail.root@uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.1508) Cc: freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Aug 2013 07:16:40 -0000 On Aug 15, 2013, at 2:39 PM, Rick Macklem wrote: > Michael Tratz wrote: >>=20 >> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov >> wrote: >>=20 >>> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: >>>> Let's assume the pid which started the deadlock is 14001 (it will >>>> be a different pid when we get the results, because the machine >>>> has been restarted) >>>>=20 >>>> I type: >>>>=20 >>>> show proc 14001 >>>>=20 >>>> I get the thread numbers from that output and type: >>>>=20 >>>> show thread xxxxx >>>>=20 >>>> for each one. >>>>=20 >>>> And a trace for each thread with the command? >>>>=20 >>>> tr xxxx >>>>=20 >>>> Anything else I should try to get or do? Or is that not the data >>>> at all you are looking for? >>>>=20 >>> Yes, everything else which is listed in the 'debugging deadlocks' >>> page >>> must be provided, otherwise the deadlock cannot be tracked. >>>=20 >>> The investigator should be able to see the whole deadlock chain >>> (loop) >>> to make any useful advance. >>=20 >> Ok, I have made some excellent progress in debugging the NFS >> deadlock. >>=20 >> Rick! You are genius. :-) You found the right commit r250907 (dated >> May 22) is the definitely the problem. >>=20 >> Here is how I did the testing: One machine received a kernel before >> r250907, the second machine received a kernel after r250907. Sure >> enough within a few hours the machine with r250907 went into the >> usual deadlock state. The machine without that commit kept on >> working fine. Then I went back to the latest revision (r253726), but >> leaving r250907 out. The machines have been running happy and rock >> solid without any deadlocks. I have expanded the testing to 3 >> machines now and no reports of any issues. >>=20 >> I guess now Konstantin has to figure out why that commit is causing >> the deadlock. Lovely! :-) I will get that information as soon as >> possible. I'm a little behind with normal work load, but I expect to >> have the data by Tuesday evening or Wednesday. >>=20 > Have you been able to pass the debugging info on to Kostik? >=20 > It would be really nice to get this fixed for FreeBSD9.2. >=20 > Thanks for your help with this, rick Sorry Rick, I wasn't able to get you guys that info quickly enough. I = thought I would have enough time, before my own wedding and honeymoon = came along, but everything went a little crazy and stressful. I didn't = think it would be this nuts. :-) I'm caught up with everything and from what I can see from the = discussions is that we know now what the problem is. I can report that the machines which I have had without r250907 have = been running without any problems for 27+ days. If you need me to test any new patches, please let me know. If I should = test with the partial merge of r253927 I'll be happy to do so. Thanks, Michael