From owner-freebsd-stable@FreeBSD.ORG  Sun Aug 25 07:16:40 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 2BD2353B
 for <freebsd-stable@freebsd.org>; Sun, 25 Aug 2013 07:16:40 +0000 (UTC)
 (envelope-from prvs=0949e860e2=michael@esosoft.com)
Received: from eagle.esosoft.net (eagle.esosoft.net [66.241.144.8])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 156992D08
 for <freebsd-stable@freebsd.org>; Sun, 25 Aug 2013 07:16:39 +0000 (UTC)
Received: from [74.100.23.197] (port=41089 helo=michaelimac.castillodelsol.com)
 by eagle.esosoft.net with esmtpsa (TLSv1:AES128-SHA:128)
 (Exim 4.80.1 (FreeBSD)) (envelope-from <michael@esosoft.com>)
 id 1VDUAs-0005uJ-BD; Sat, 24 Aug 2013 23:51:46 -0700
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\))
Subject: Re: NFS deadlock on 9.2-Beta1
From: Michael Tratz <michael@esosoft.com>
In-Reply-To: <461392652.9990692.1376602743970.JavaMail.root@uoguelph.ca>
Date: Sat, 24 Aug 2013 23:51:45 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <40674FAC-33E6-4994-819E-6B8318B9DDB3@esosoft.com>
References: <461392652.9990692.1376602743970.JavaMail.root@uoguelph.ca>
To: Rick Macklem <rmacklem@uoguelph.ca>
X-Mailer: Apple Mail (2.1508)
Cc: freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 25 Aug 2013 07:16:40 -0000


On Aug 15, 2013, at 2:39 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Michael Tratz wrote:
>>=20
>> On Jul 27, 2013, at 11:25 PM, Konstantin Belousov
>> <kostikbel@gmail.com> wrote:
>>=20
>>> On Sat, Jul 27, 2013 at 03:13:05PM -0700, Michael Tratz wrote:
>>>> Let's assume the pid which started the deadlock is 14001 (it will
>>>> be a different pid when we get the results, because the machine
>>>> has been restarted)
>>>>=20
>>>> I type:
>>>>=20
>>>> show proc 14001
>>>>=20
>>>> I get the thread numbers from that output and type:
>>>>=20
>>>> show thread xxxxx
>>>>=20
>>>> for each one.
>>>>=20
>>>> And a trace for each thread with the command?
>>>>=20
>>>> tr xxxx
>>>>=20
>>>> Anything else I should try to get or do? Or is that not the data
>>>> at all you are looking for?
>>>>=20
>>> Yes, everything else which is listed in the 'debugging deadlocks'
>>> page
>>> must be provided, otherwise the deadlock cannot be tracked.
>>>=20
>>> The investigator should be able to see the whole deadlock chain
>>> (loop)
>>> to make any useful advance.
>>=20
>> Ok, I have made some excellent progress in debugging the NFS
>> deadlock.
>>=20
>> Rick! You are genius. :-) You found the right commit r250907 (dated
>> May 22) is the definitely the problem.
>>=20
>> Here is how I did the testing: One machine received a kernel before
>> r250907, the second machine received a kernel after r250907. Sure
>> enough within a few hours the machine with r250907 went into the
>> usual deadlock state. The machine without that commit kept on
>> working fine. Then I went back to the latest revision (r253726), but
>> leaving r250907 out. The machines have been running happy and rock
>> solid without any deadlocks. I have expanded the testing to 3
>> machines now and no reports of any issues.
>>=20
>> I guess now Konstantin has to figure out why that commit is causing
>> the deadlock. Lovely! :-) I will get that information as soon as
>> possible. I'm a little behind with normal work load, but I expect to
>> have the data by Tuesday evening or Wednesday.
>>=20
> Have you been able to pass the debugging info on to Kostik?
>=20
> It would be really nice to get this fixed for FreeBSD9.2.
>=20
> Thanks for your help with this, rick

Sorry Rick, I wasn't able to get you guys that info quickly enough. I =
thought I would have enough time, before my own wedding and honeymoon =
came along, but everything went a little crazy and stressful. I didn't =
think it would be this nuts. :-)

I'm caught up with everything and from what I can see from the =
discussions is that we know now what the problem is.

I can report that the machines which I have had without r250907 have =
been running without any problems for 27+ days.

If you need me to test any new patches, please let me know. If I should =
test with the partial merge of r253927 I'll be happy to do so.

Thanks,

Michael