From owner-freebsd-stable@FreeBSD.ORG Mon Mar 5 18:50:29 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 01C3C1065670 for ; Mon, 5 Mar 2012 18:50:29 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-ey0-f182.google.com (mail-ey0-f182.google.com [209.85.215.182]) by mx1.freebsd.org (Postfix) with ESMTP id 702B28FC08 for ; Mon, 5 Mar 2012 18:50:28 +0000 (UTC) Received: by eaaf13 with SMTP id f13so1605875eaa.13 for ; Mon, 05 Mar 2012 10:50:27 -0800 (PST) Received-SPF: pass (google.com: domain of asmrookie@gmail.com designates 10.112.10.41 as permitted sender) client-ip=10.112.10.41; Authentication-Results: mr.google.com; spf=pass (google.com: domain of asmrookie@gmail.com designates 10.112.10.41 as permitted sender) smtp.mail=asmrookie@gmail.com; dkim=pass header.i=asmrookie@gmail.com Received: from mr.google.com ([10.112.10.41]) by 10.112.10.41 with SMTP id f9mr9980291lbb.8.1330973427454 (num_hops = 1); Mon, 05 Mar 2012 10:50:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=2UBW6xw2KQXqd92fi2sMdmpuPYR0mutj+yINELc4HE0=; b=mV+Q+nJJHKGBnGZ9KFGudol01Qa9vq0RlaJajZa8Zhm6yhYz7VYmFKTl+jFK78XTRE wfPMfFfOlvBh/xAWE+nvbH+wUTjmoo2wvoB6ibKvf74yRTCHxO8LVtShYBrCZVJsuPe2 LWGaNCD5ObdO7Kpqz55CFSA6dNFdaO5wADLSW7hUDcLzAOtoH0pF750sXVN7cW1NM33C 2JjLvnp5L9Niapno7sOR/u1ZaaHqdSoTUeHSMxSyEaT2rmX3eE5E8aNP49gz4fcYCequ Pt3eUFkzLCRu6b24Yepp3paJVf9vXn7vPCmuze4FeFhTfCRZKBAV4dKxJlzA89KUeDrO S6zQ== MIME-Version: 1.0 Received: by 10.112.10.41 with SMTP id f9mr8184978lbb.8.1330973427372; Mon, 05 Mar 2012 10:50:27 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.41.5 with HTTP; Mon, 5 Mar 2012 10:50:27 -0800 (PST) In-Reply-To: References: Date: Mon, 5 Mar 2012 18:50:27 +0000 X-Google-Sender-Auth: Nk9-vvWl2Q1kQ3SV7iNltrn43_w Message-ID: From: Attilio Rao To: Arnaud Lacombe Content-Type: text/plain; charset=UTF-8 Cc: freebsd-stable Subject: Re: Complete hang on 9.0-RELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Mar 2012 18:50:29 -0000 2012/3/5, Arnaud Lacombe : > Hi, > > On Wed, Feb 29, 2012 at 2:31 PM, Arnaud Lacombe wrote: >> Hi, >> >> On Wed, Feb 29, 2012 at 2:22 PM, Attilio Rao wrote: >>> 2012/2/29, Arnaud Lacombe : >>>> Hi, >>>> >>>> On Wed, Feb 29, 2012 at 1:44 PM, Attilio Rao >>>> wrote: >>>>> 2012/2/29, Arnaud Lacombe : >>>>>> Hi, >>>>>> >>>>>> On Wed, Feb 29, 2012 at 12:59 PM, Arnaud Lacombe >>>>>> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On Mon, Feb 27, 2012 at 12:48 PM, Arnaud Lacombe >>>>>>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On Mon, Feb 27, 2012 at 10:36 AM, Attilio Rao >>>>>>>> wrote: >>>>>>>>> 2012/2/27, Arnaud Lacombe : >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>> Hi folks, >>>>>>>>>>> >>>>>>>>>>> For the records, I was running some tests yesterday on top of a >>>>>>>>>>> 9.0-RELEASE, amd64, kernel when the box hanged. At the time of >>>>>>>>>>> the >>>>>>>>>>> hang, the box was running a process with about 2800 threads with >>>>>>>>>>> heavy >>>>>>>>>>> IPC between 1400 writers and 1400 readers. The box was in single >>>>>>>>>>> user >>>>>>>>>>> mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the >>>>>>>>>>> beginning >>>>>>>>>>> of the dmesg: >>>>>>>>>>> >>>>>>>>>> This happened a second time, now with FreeBSD 8.2-RELEASE. >>>>>>>>>> Complete >>>>>>>>>> machine hang. The machine was running about 4000 threads in a >>>>>>>>>> single >>>>>>>>>> process, all the other condition are the same. >>>>>>>>> >>>>>>>>> Arnaud, >>>>>>>>> can you please break in your kernel via KDB, collect the following >>>>>>>>> informations from the DDB prompt: >>>>>>>>> - ps >>>>>>>>> - alltrace >>>>>>>>> - show allpcpu >>>>>>>>> - possibly get a coredump with 'call doadump' >>>>>>>>> >>>>>>>> Will do, but I'll need to rebuild a kernel to include DDB. >>>>>>>> >>>>>>>>> and in the end provide all those along with kernel binary and >>>>>>>>> possibly >>>>>>>>> sources somewhere? >>>>>>>>> >>>>>>>> I'll be testing a bare `release/8.2.0' with the following patch: >>>>>>>> >>>>>>>> diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC >>>>>>>> index c3e0095..7bd997f 100644 >>>>>>>> --- a/sys/amd64/conf/GENERIC >>>>>>>> +++ b/sys/amd64/conf/GENERIC >>>>>>>> @@ -79,6 +79,10 @@ options INCLUDE_CONFIG_FILE # Include >>>>>>>> this >>>>>>>> file in kernel >>>>>>>> >>>>>>>> options KDB # Kernel debugger related code >>>>>>>> options KDB_TRACE # Print a stack trace for a panic >>>>>>>> +options DDB >>>>>>>> +options BREAK_TO_DEBUGGER >>>>>>>> +options ALT_BREAK_TO_DEBUGGER >>>>>>>> >>>>>>>> # Make an SMP-capable kernel by default >>>>>>>> options SMP # Symmetric MultiProcessor Kernel >>>>>>>> >>>>>>> ok, it happened again after 2 days, the process was running about >>>>>>> 3200 >>>>>>> threads. I'm trying to break into DDB and let you know, I'm not that >>>>>>> successful for now... >>>>>>> >>>>>> No luck. None of BREAK or ALT_BREAK are responding. I will not touch >>>>>> the system in the next few hours if you want me to test something on >>>>>> it. In the event of 8.2-RELEASE or 9.0-RELEASE are not meant to work >>>>>> reliably on top of a 7.4-RELEASE userland, I will re-setup the test to >>>>>> occurs on a clean 9.0-RELEASE system and re-try. >>>>> >>>>> We allow to break KBI when new releases happens, thus this may cause a >>>>> breakage for you, even if a deadlock is really not something you want. >>>>> >>>>> Can you try enabling SW_WATCHDOG, DEADLKRES and possibly arm your >>>>> ichwd? >>>>> if the breakage involves clocks or interrupt sources there are still >>>>> chances they will be able to catch it though. >>>>> >>>>> However, it doesn't seem you are setup with a proper serial console? >>>> The serial console is working definitively fine. I can break into DDB >>>> at will when the test is running. I did not test with ALT_BREAK >>>> per-se, but BREAK does work. >>> >>> So if you try to break in DDB via serial break it doesn't work? >>> That is definitively very bad... >>> >> just to be sure, I rebooted the system and I could break into DDB at >> the first attempt with ALT_BREAK, BREAK was a bit more reluctant but >> worked too. So yes, this does not taste good :/ >> >>> Can you try with the options I mentioned earlier and see if something >>> changes? >>> >> will do, but I will first attempt to reproduce this on 9.0-RELEASE. >> > 9.0-RELEASE (kernel + userland) hanged today while running 2000 > threads. Next step is to reproduce it with a watchdog+textdump enabled > kernel. And you were still unable to break in DDB, right? Attilio -- Peace can only be achieved by understanding - A. Einstein