From owner-freebsd-stable@FreeBSD.ORG Mon Jul 29 23:37:15 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 6E6C478C; Mon, 29 Jul 2013 23:37:15 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 207F726E1; Mon, 29 Jul 2013 23:37:14 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqEEAH379lGDaFve/2dsb2JhbABbhAuDELpfgTJ0giQBAQQBI1YFFhgCAg0ZAiM2BhMah2QDCQanFohzDYhegSiLbYI0NAeCZYEiA5V2jg+FJoMwIIFu X-IronPort-AV: E=Sophos;i="4.89,773,1367985600"; d="scan'208";a="42607342" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 29 Jul 2013 19:37:14 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 268F4B3F1D; Mon, 29 Jul 2013 19:37:12 -0400 (EDT) Date: Mon, 29 Jul 2013 19:37:12 -0400 (EDT) From: Rick Macklem To: Michael Tratz Message-ID: <1710471570.3603170.1375141032147.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: NFS deadlock on 9.2-Beta1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Konstantin Belousov , freebsd-stable@freebsd.org, Steven Hartland , re X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Jul 2013 23:37:15 -0000 Michael Tratz wrote: > > On Jul 27, 2013, at 11:25 PM, Konstantin Belousov > wrote: > > > On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote: > >> Let's assume the pid which started the deadlock is 14001 (it will > >> be a different pid when we get the results, because the machine > >> has been restarted) > >> > >> I type: > >> > >> show proc 14001 > >> > >> I get the thread numbers from that output and type: > >> > >> show thread xxxxx > >> > >> for each one. > >> > >> And a trace for each thread with the command? > >> > >> tr xxxx > >> > >> Anything else I should try to get or do? Or is that not the data > >> at all you are looking for? > >> > > Yes, everything else which is listed in the 'debugging deadlocks' > > page > > must be provided, otherwise the deadlock cannot be tracked. > > > > The investigator should be able to see the whole deadlock chain > > (loop) > > to make any useful advance. > > Ok, I have made some excellent progress in debugging the NFS > deadlock. > > Rick! You are genius. :-) You found the right commit r250907 (dated > May 22) is the definitely the problem. > Nowhere close, take my word for it;-) (At least you put a smiley after it.) (I've never actually even been employed as a software developer, but that's off topic.) I just got lucky (basically there wasn't any other commit that seemed it might cause this). But, the good news is that it is partially isolated. Hopefully the debugging stuff you get for Kostik will allow him (I suspect he is a genius) to solve the problem. (If I was going to take another "shot in the dark", I'd guess its r250027 moving the vn_lock() call. Maybe calling vm_page_grab() with the shared vnode lock held?) I've added re@ to the cc list, since I think this might be a show stopper for 9.2? Thanks for reporting this and all your help with tracking it down, rick > Here is how I did the testing: One machine received a kernel before > r250907, the second machine received a kernel after r250907. Sure > enough within a few hours the machine with r250907 went into the > usual deadlock state. The machine without that commit kept on > working fine. Then I went back to the latest revision (r253726), but > leaving r250907 out. The machines have been running happy and rock > solid without any deadlocks. I have expanded the testing to 3 > machines now and no reports of any issues. > > I guess now Konstantin has to figure out why that commit is causing > the deadlock. Lovely! :-) I will get that information as soon as > possible. I'm a little behind with normal work load, but I expect to > have the data by Tuesday evening or Wednesday. > > Thanks again!! > > Michael > >