From owner-freebsd-stable@FreeBSD.ORG Thu Aug 18 01:04:35 2011 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 656BA106564A; Thu, 18 Aug 2011 01:04:35 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-yx0-f182.google.com (mail-yx0-f182.google.com [209.85.213.182]) by mx1.freebsd.org (Postfix) with ESMTP id DFF838FC08; Thu, 18 Aug 2011 01:04:34 +0000 (UTC) Received: by yxn22 with SMTP id 22so265810yxn.13 for ; Wed, 17 Aug 2011 18:04:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=n4cvQcucnODsJVGi9I5dVUPScuIcb3TwlppDmN49vl8=; b=epFRImjgWwckQCIxi9uEJIJvXJ44SaCcTcHEh8BBxmkJN6WEDFePl0iA/XdZul7toT HxeW81vbc4wkGqAXfe2HPix3sCrRMbh1aqsLRrRdaC2LFEbvTKIhY8e5Pn78xXeBalqU yegfacVxvQFlNq2MGb1j/XF7vvcHr3oXgSVjA= MIME-Version: 1.0 Received: by 10.236.182.66 with SMTP id n42mr113076yhm.128.1313629474018; Wed, 17 Aug 2011 18:04:34 -0700 (PDT) Sender: asmrookie@gmail.com Received: by 10.236.108.33 with HTTP; Wed, 17 Aug 2011 18:04:32 -0700 (PDT) In-Reply-To: <20110818.091600.831954331552558249.hrs@allbsd.org> References: <20110818.023832.373949045518579359.hrs@allbsd.org> <20110818.043332.27079545013461535.hrs@allbsd.org> <20110818.091600.831954331552558249.hrs@allbsd.org> Date: Thu, 18 Aug 2011 03:04:32 +0200 X-Google-Sender-Auth: k_UDqQniEWum2a7YNdHBkktTkYU Message-ID: From: Attilio Rao To: Hiroki Sato Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-stable@freebsd.org, sterling@camdensoftware.com, avg@freebsd.org, Nick Esborn , kostikbel@gmail.com, mdtansca@freebsd.org Subject: Re: panic: spin lock held too long (RELENG_8 from today) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 Aug 2011 01:04:35 -0000 2011/8/18 Hiroki Sato : > Hiroki Sato wrote > =C2=A0in <20110818.043332.27079545013461535.hrs@allbsd.org>: > > hr> Attilio Rao wrote > hr> =C2=A0 in : > hr> > hr> at> 2011/8/17 Hiroki Sato : > hr> at> > Hi, > hr> at> > > hr> at> > Mike Tancsa wrote > hr> at> > =C2=A0in <4E15A08C.6090407@sentex.net>: > hr> at> > > hr> at> > mi> On 7/7/2011 7:32 AM, Mike Tancsa wrote: > hr> at> > mi> > On 7/7/2011 4:20 AM, Kostik Belousov wrote: > hr> at> > mi> >> > hr> at> > mi> >> BTW, we had a similar panic, "spinlock held too long", t= he spinlock > hr> at> > mi> >> is the sched lock N, on busy 8-core box recently upgrade= d to the > hr> at> > mi> >> stable/8. Unfortunately, machine hung dumping core, so t= he stack trace > hr> at> > mi> >> for the owner thread was not available. > hr> at> > mi> >> > hr> at> > mi> >> I was unable to make any conclusion from the data that w= as present. > hr> at> > mi> >> If the situation is reproducable, you coulld try to reve= rt r221937. This > hr> at> > mi> >> is pure speculation, though. > hr> at> > mi> > > hr> at> > mi> > Another crash just now after 5hrs uptime. I will try and = revert r221937 > hr> at> > mi> > unless there is any extra debugging you want me to add to= the kernel > hr> at> > mi> > instead =C2=A0? > hr> at> > > hr> at> > =C2=A0I am also suffering from a reproducible panic on an 8-STA= BLE box, an > hr> at> > =C2=A0NFS server with heavy I/O load. =C2=A0I could not get a k= ernel dump > hr> at> > =C2=A0because this panic locked up the machine just after it oc= curred, but > hr> at> > =C2=A0according to the stack trace it was the same as posted on= e. > hr> at> > =C2=A0Switching to an 8.2R kernel can prevent this panic. > hr> at> > > hr> at> > =C2=A0Any progress on the investigation? > hr> at> > hr> at> Hiroki, > hr> at> how easilly can you reproduce it? > hr> > hr> =C2=A0It takes 5-10 hours. =C2=A0I installed another kernel for debug= ging just > hr> =C2=A0now, so I think I will be able to collect more detail informati= on in > hr> =C2=A0a couple of days. > hr> > hr> at> It would be important to have a DDB textdump with these informati= ons: > hr> at> - bt > hr> at> - ps > hr> at> - show allpcpu > hr> at> - alltrace > hr> at> > hr> at> Alternatively, a coredump which has the stop cpu patch which Andr= yi can provide. > hr> > hr> =C2=A0Okay, I will post them once I can get another panic. =C2=A0Than= ks! > > =C2=A0I got the panic with a crash dump this time. =C2=A0The result of bt= , ps, > =C2=A0allpcpu, and traces can be found at the following URL: > > =C2=A0http://people.allbsd.org/~hrs/FreeBSD/pool-panic_20110818-1.txt Actually, I think I see the bug here. In callout_cpu_switch() if a low priority thread is migrating the callout and gets preempted after the outcoming cpu queue lock is left (and scheduled much later) we get this problem. In order to fix this bug it could be enough to use a critical section, but I think this should be really interrupt safe, thus I'd wrap them up with spinlock_enter()/spinlock_exit(). Fortunately callout_cpu_switch() should be called rarely and also we already do expensive locking operations in callout, thus we should not have problem performance-wise. Can the guys I also CC'ed here try the following patch, with all the initial kernel options that were leading you to the deadlock? (thus revert any debugging patch/option you added for the moment): http://www.freebsd.org/~attilio/callout-fixup.diff Please note that this patch is for STABLE_8, if you can confirm the good result I'll commit to -CURRENT and then backmarge as soon as possible. Thanks, Attilio --=20 Peace can only be achieved by understanding - A. Einstein