From owner-freebsd-stable@FreeBSD.ORG Mon Mar 5 18:39:19 2012 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CBEE81065672; Mon, 5 Mar 2012 18:39:19 +0000 (UTC) (envelope-from lacombar@gmail.com) Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by mx1.freebsd.org (Postfix) with ESMTP id 27C238FC16; Mon, 5 Mar 2012 18:39:18 +0000 (UTC) Received: by wibhn6 with SMTP id hn6so2810993wib.13 for ; Mon, 05 Mar 2012 10:39:18 -0800 (PST) Received-SPF: pass (google.com: domain of lacombar@gmail.com designates 10.180.95.34 as permitted sender) client-ip=10.180.95.34; Authentication-Results: mr.google.com; spf=pass (google.com: domain of lacombar@gmail.com designates 10.180.95.34 as permitted sender) smtp.mail=lacombar@gmail.com; dkim=pass header.i=lacombar@gmail.com Received: from mr.google.com ([10.180.95.34]) by 10.180.95.34 with SMTP id dh2mr17317000wib.15.1330972758188 (num_hops = 1); Mon, 05 Mar 2012 10:39:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=9/fGJKYw+HHedcJK/wtV7sAn0z23/AWIn3o3FL3LJOw=; b=vLHPDujrfvLfTfEjgV70od1+KL3R4iX8f1BOGNLNeqdaqriT2rrd17L/KRGHwbxPEt Eu07tj7ireoTj6Z/wTCrGX+9Cc8JFrJWpvrq3rpU+4Ab9EZir9KJeaWqy+LQr0Tdq37c oZRvjFMRvW8lbVerQ5p6dgJZ4utC7NALM77d5VxJOlnhFjpQC4rYwfPPG/s+9pHcnX2z UqJ6cT4MQs2YwUtMorFI2LlRzONczEUOA45eP7upv5TYWIYAQhaFQ16sUbmQpsarfeve 8RqZoUJ+SSwCk4poGxQoW4gdfKayOy5L6JeTM4fv/+KrRvyo0TNoA2/K2BIplRhnWWd+ Xg0Q== MIME-Version: 1.0 Received: by 10.180.95.34 with SMTP id dh2mr13760053wib.15.1330972758111; Mon, 05 Mar 2012 10:39:18 -0800 (PST) Received: by 10.216.166.139 with HTTP; Mon, 5 Mar 2012 10:39:18 -0800 (PST) In-Reply-To: References: Date: Mon, 5 Mar 2012 13:39:18 -0500 Message-ID: From: Arnaud Lacombe To: Attilio Rao Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: freebsd-stable Subject: Re: Complete hang on 9.0-RELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Mar 2012 18:39:19 -0000 Hi, On Wed, Feb 29, 2012 at 2:31 PM, Arnaud Lacombe wrote: > Hi, > > On Wed, Feb 29, 2012 at 2:22 PM, Attilio Rao wrote: >> 2012/2/29, Arnaud Lacombe : >>> Hi, >>> >>> On Wed, Feb 29, 2012 at 1:44 PM, Attilio Rao wrot= e: >>>> 2012/2/29, Arnaud Lacombe : >>>>> Hi, >>>>> >>>>> On Wed, Feb 29, 2012 at 12:59 PM, Arnaud Lacombe >>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> On Mon, Feb 27, 2012 at 12:48 PM, Arnaud Lacombe >>>>>> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On Mon, Feb 27, 2012 at 10:36 AM, Attilio Rao >>>>>>> wrote: >>>>>>>> 2012/2/27, Arnaud Lacombe : >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> On Tue, Feb 14, 2012 at 11:41 AM, Arnaud Lacombe >>>>>>>>> wrote: >>>>>>>>>> Hi folks, >>>>>>>>>> >>>>>>>>>> For the records, I was running some tests yesterday on top of a >>>>>>>>>> 9.0-RELEASE, amd64, kernel when the box hanged. At the time of t= he >>>>>>>>>> hang, the box was running a process with about 2800 threads with >>>>>>>>>> heavy >>>>>>>>>> IPC between 1400 writers and 1400 readers. The box was in single >>>>>>>>>> user >>>>>>>>>> mode (/bin/sh coming from FreeBSD 7.4-STABLE). Here is the begin= ning >>>>>>>>>> of the dmesg: >>>>>>>>>> >>>>>>>>> This happened a second time, now with FreeBSD 8.2-RELEASE. Comple= te >>>>>>>>> machine hang. The machine was running about 4000 threads in a sin= gle >>>>>>>>> process, all the other condition are the same. >>>>>>>> >>>>>>>> Arnaud, >>>>>>>> can you please break in your kernel via KDB, collect the following >>>>>>>> informations from the DDB prompt: >>>>>>>> - ps >>>>>>>> - alltrace >>>>>>>> - show allpcpu >>>>>>>> - possibly get a coredump with 'call doadump' >>>>>>>> >>>>>>> Will do, but I'll need to rebuild a kernel to include DDB. >>>>>>> >>>>>>>> and in the end provide all those along with kernel binary and poss= ibly >>>>>>>> sources somewhere? >>>>>>>> >>>>>>> I'll be testing a bare `release/8.2.0' with the following patch: >>>>>>> >>>>>>> diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC >>>>>>> index c3e0095..7bd997f 100644 >>>>>>> --- a/sys/amd64/conf/GENERIC >>>>>>> +++ b/sys/amd64/conf/GENERIC >>>>>>> @@ -79,6 +79,10 @@ options =A0 =A0 =A0INCLUDE_CONFIG_FILE =A0 =A0 #= Include this >>>>>>> file in kernel >>>>>>> >>>>>>> =A0options =A0 =A0 =A0 =A0KDB =A0 =A0 =A0 =A0 =A0 # Kernel debugger= related code >>>>>>> =A0options =A0 =A0 =A0 =A0KDB_TRACE =A0 =A0 # Print a stack trace f= or a panic >>>>>>> +options =A0 =A0 =A0 =A0DDB >>>>>>> +options =A0 =A0 =A0 =A0BREAK_TO_DEBUGGER >>>>>>> +options =A0 =A0 =A0 =A0ALT_BREAK_TO_DEBUGGER >>>>>>> >>>>>>> =A0# Make an SMP-capable kernel by default >>>>>>> =A0options =A0 =A0 =A0 =A0SMP =A0 =A0 =A0 =A0 =A0 # Symmetric Multi= Processor Kernel >>>>>>> >>>>>> ok, it happened again after 2 days, the process was running about 32= 00 >>>>>> threads. I'm trying to break into DDB and let you know, I'm not that >>>>>> successful for now... >>>>>> >>>>> No luck. None of BREAK or ALT_BREAK are responding. I will not touch >>>>> the system in the next few hours if you want me to test something on >>>>> it. In the event of 8.2-RELEASE or 9.0-RELEASE are =A0not meant to wo= rk >>>>> reliably on top of a 7.4-RELEASE userland, I will re-setup the test t= o >>>>> occurs on a clean 9.0-RELEASE system and re-try. >>>> >>>> We allow to break KBI when new releases happens, thus this may cause a >>>> breakage for you, even if a deadlock is really not something you want. >>>> >>>> Can you try enabling SW_WATCHDOG, DEADLKRES and possibly arm your ichw= d? >>>> if the breakage involves clocks or interrupt sources there are still >>>> chances they will be able to catch it though. >>>> >>>> However, it doesn't seem you are setup with a proper serial console? >>> The serial console is working definitively fine. I can break into DDB >>> at will when the test is running. I did not test with ALT_BREAK >>> per-se, but BREAK does work. >> >> So if you try to break in DDB via serial break it doesn't work? >> That is definitively very bad... >> > just to be sure, I rebooted the system and I could break into DDB at > the first attempt with ALT_BREAK, BREAK was a bit more reluctant but > worked too. So yes, this does not taste good :/ > >> Can you try with the options I mentioned earlier and see if something ch= anges? >> > will do, but I will first attempt to reproduce this on 9.0-RELEASE. > 9.0-RELEASE (kernel + userland) hanged today while running 2000 threads. Next step is to reproduce it with a watchdog+textdump enabled kernel. - Arnaud