From owner-freebsd-stable@FreeBSD.ORG Mon Jul 29 20:44:49 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id E78511CC for ; Mon, 29 Jul 2013 20:44:49 +0000 (UTC) (envelope-from prvs=0922382a3d=michael@esosoft.com) Received: from eagle.esosoft.net (eagle.esosoft.net [66.241.144.8]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id CEFE42FA3 for ; Mon, 29 Jul 2013 20:44:49 +0000 (UTC) Received: from [74.100.23.197] (port=30160 helo=michaelimac.castillodelsol.com) by eagle.esosoft.net with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1V3uJ5-000LUM-MF; Mon, 29 Jul 2013 13:44:39 -0700 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: NFS deadlock on 9.2-Beta1 From: Michael Tratz In-Reply-To: <20130728062545.GE4972@kib.kiev.ua> Date: Mon, 29 Jul 2013 13:44:39 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <780BC2DB-3BBA-4396-852B-0EBDF30BF985@esosoft.com> <806421474.2797338.1374956449542.JavaMail.root@uoguelph.ca> <20130727205815.GC4972@kib.kiev.ua> <602747E8-0EBE-4BB1-8019-C02C25B75FA1@esosoft.com> <20130728062545.GE4972@kib.kiev.ua> To: Konstantin Belousov X-Mailer: Apple Mail (2.1508) Cc: freebsd-stable@freebsd.org, Rick Macklem , Steven Hartland X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Jul 2013 20:44:50 -0000 On Jul 27, 2013, at 11:25 PM, Konstantin Belousov = wrote: > On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: >> Let's assume the pid which started the deadlock is 14001 (it will be = a different pid when we get the results, because the machine has been = restarted) >>=20 >> I type: >>=20 >> show proc 14001 >>=20 >> I get the thread numbers from that output and type: >>=20 >> show thread xxxxx >>=20 >> for each one. >>=20 >> And a trace for each thread with the command? >>=20 >> tr xxxx >>=20 >> Anything else I should try to get or do? Or is that not the data at = all you are looking for? >>=20 > Yes, everything else which is listed in the 'debugging deadlocks' page > must be provided, otherwise the deadlock cannot be tracked. >=20 > The investigator should be able to see the whole deadlock chain (loop) > to make any useful advance. Ok, I have made some excellent progress in debugging the NFS deadlock. Rick! You are genius. :-) You found the right commit r250907 (dated May = 22) is the definitely the problem. Here is how I did the testing: One machine received a kernel before = r250907, the second machine received a kernel after r250907. Sure enough = within a few hours the machine with r250907 went into the usual deadlock = state. The machine without that commit kept on working fine. Then I went = back to the latest revision (r253726), but leaving r250907 out. The = machines have been running happy and rock solid without any deadlocks. I = have expanded the testing to 3 machines now and no reports of any = issues. I guess now Konstantin has to figure out why that commit is causing the = deadlock. Lovely! :-) I will get that information as soon as possible. = I'm a little behind with normal work load, but I expect to have the data = by Tuesday evening or Wednesday. Thanks again!! Michael