From owner-freebsd-scsi@FreeBSD.ORG Tue Apr 26 05:52:06 2011 Return-Path: Delivered-To: scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 646F2106566B; Tue, 26 Apr 2011 05:52:06 +0000 (UTC) (envelope-from Andre.Albsmeier@siemens.com) Received: from thoth.sbs.de (thoth.sbs.de [192.35.17.2]) by mx1.freebsd.org (Postfix) with ESMTP id D8A128FC08; Tue, 26 Apr 2011 05:52:05 +0000 (UTC) Received: from mail2.siemens.de (localhost [127.0.0.1]) by thoth.sbs.de (8.13.6/8.13.6) with ESMTP id p3Q5q4L0018787; Tue, 26 Apr 2011 07:52:04 +0200 Received: from curry.mchp.siemens.de (curry.mchp.siemens.de [139.25.40.130]) by mail2.siemens.de (8.13.6/8.13.6) with ESMTP id p3Q5q4Yo005703; Tue, 26 Apr 2011 07:52:04 +0200 Received: (from localhost) by curry.mchp.siemens.de (8.14.4/8.14.4) id p3Q5q4Qs039552; Date: Tue, 26 Apr 2011 07:52:04 +0200 From: Andre Albsmeier To: John Baldwin Message-ID: <20110426055204.GA59975@curry.mchp.siemens.de> References: <201102041444.p14EixJP087709@svn.freebsd.org> <201104190920.25924.jhb@freebsd.org> <20110420053226.GA22854@curry.mchp.siemens.de> <201104251703.18518.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201104251703.18518.jhb@freebsd.org> X-Echelon: X-Advice: Drop that crappy M$-Outlook, I'm tired of your viruses! User-Agent: Mutt/1.5.20 (2009-06-14) Cc: "svn-src-stable-7@freebsd.org" , "scsi@freebsd.org" Subject: Re: svn commit: r218277 - in stable/7/sys: kern sys X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Apr 2011 05:52:06 -0000 On Mon, 25-Apr-2011 at 23:03:18 +0200, John Baldwin wrote: > On Wednesday, April 20, 2011 1:32:26 am Andre Albsmeier wrote: > > On Tue, 19-Apr-2011 at 15:20:25 +0200, John Baldwin wrote: > > > On Tuesday, April 19, 2011 8:56:48 am Andre Albsmeier wrote: > > > > On Mon, 18-Apr-2011 at 15:18:25 +0200, John Baldwin wrote: > > > > > On Monday, April 18, 2011 7:36:57 am Andre Albsmeier wrote: > > > > > > On Fri, 15-Apr-2011 at 18:35:05 +0200, John Baldwin wrote: > > > > > > > On Friday, April 15, 2011 9:25:25 am Andre Albsmeier wrote: > > > > > > > > On Fri, 04-Feb-2011 at 14:44:59 +0000, John Baldwin wrote: > > > > > > > > > Author: jhb > > > > > > > > > Date: Fri Feb 4 14:44:59 2011 > > > > > > > > > New Revision: 218277 > > > > > > > > > URL: http://svn.freebsd.org/changeset/base/218277 > > > > > > > > > > > > > > > > > > Log: > > > > > > > > > MFC 217075: > > > > > > > > > Retire PCONFIG and leave the priority of thread0 alone when > waiting for > > > > > > > > > interrupt config hooks to execute. > > > > > > > > > > > > > > > > > > To preserve the KBI, I did not renumber priorities but > simply removed > > > > > > > > > PCONFIG. > > > > > > > > > > > > > > > > > > Modified: > > > > > > > > > stable/7/sys/kern/subr_autoconf.c > > > > > > > > > stable/7/sys/sys/priority.h > > > > > > > > > Directory Properties: > > > > > > > > > stable/7/sys/ (props changed) > > > > > > > > > stable/7/sys/cddl/contrib/opensolaris/ (props changed) > > > > > > > > > stable/7/sys/contrib/dev/acpica/ (props changed) > > > > > > > > > stable/7/sys/contrib/pf/ (props changed) > > > > > > > > > > > > > > > > > > Modified: stable/7/sys/kern/subr_autoconf.c > > > > > > > > > > > > > > > > > ============================================================================== > > > > > > > > > --- stable/7/sys/kern/subr_autoconf.c Fri Feb 4 14:44:42 > 2011 > > > > > > > (r218276) > > > > > > > > > +++ stable/7/sys/kern/subr_autoconf.c Fri Feb 4 14:44:59 > 2011 > > > > > > > (r218277) > > > > > > > > > @@ -108,7 +108,7 @@ run_interrupt_driven_config_hooks(dummy) > > > > > > > > > warned = 0; > > > > > > > > > while (!TAILQ_EMPTY(&intr_config_hook_list)) { > > > > > > > > > if (msleep(&intr_config_hook_list, > &intr_config_hook_lock, > > > > > > > > > - PCONFIG, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > > > + 0, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > > > EWOULDBLOCK) { > > > > > > > > > mtx_unlock(&intr_config_hook_lock); > > > > > > > > > warned++; > > > > > > > > > > > > > > > > > > > > > > > > This broke several of my machines in a somewhat strange way: > > > > > > > > > > > > > > > > After upgrading them (17) to a recent 7-STABLE (as of > 2011-04-12) > > > > > > > > I noticed that some (4) of them didn't start. All 4 didn't find > > > > > > > > their boot device anymore. What they all got in common is: > > > > > > > > > > > > > > > > - an Adaptec 2940 Ultra SCSI adapter > > > > > > > > - two SCSI harddisks (da0 and da1) of various brands > > > > > > > > - one SCSI CDROM drive (cd0) > > > > > > > > > > > > > > > > To be exact, none of the three devices (da0, da1, cd0) were > > > > > > > > detected at all. Other machines with a similar configuration > > > > > > > > (2940 and da0/da1) but _without_ the CDROM drive didn't have > > > > > > > > any problems. So I simply removed the CDROM drives on the 4 > > > > > > > > machines in question and they all booted again. > > > > > > > > > > > > > > > > Today I decided to dig into this and after reverting(*) the > > > > > > > > above change, they worked with the CDROM again. I have cross- > > > > > > > > checked it 3 times. No idea what's happening here... > > > > > > > > > > > > > > > > -Andre > > > > > > > > > > > > > > > > (*) To be honest, I use this patch so I had to modify only one > file: > > > > > > > > > > > > > > > > --- sys/kern/subr_autoconf.c.ORI 2011-02-05 13:14:11.000000000 > +0100 > > > > > > > > +++ sys/kern/subr_autoconf.c 2011-04-15 14:34:31.000000000 > +0200 > > > > > > > > @@ -108,7 +108,7 @@ > > > > > > > > warned = 0; > > > > > > > > while (!TAILQ_EMPTY(&intr_config_hook_list)) { > > > > > > > > if (msleep(&intr_config_hook_list, &intr_config_hook_lock, > > > > > > > > - 0, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > > + PRI_MIN_KERN + 32, "conifhk", WARNING_INTERVAL_SECS * > hz) == > > > > > > > > EWOULDBLOCK) { > > > > > > > > mtx_unlock(&intr_config_hook_lock); > > > > > > > > warned++; > > > > > > > > > > > > > > Do you get any warnings about CAM timeouts, etc. when these probe? > A verbose > > > > > > > dmesg might be nice to look at if possible. > > > > > > > > > > > > OK, I have set up a machine for testing. In my other mail > > > > > > I was wrong saying that the pass devices appear when using > > > > > > the problematic kernel... > > > > > > > > > > > > Here are the dmesgs: > > > > > > > > > > > > - dmesg_bad is the original kernel as of Friday > > > > > > - dmesg_ok is the patched kernel (see above) as of Friday > > > > > > - dmesg.diff is the diff between both > > > > > > > > > > > > If you want me to try something just tell me... > > > > > > > > > > Hmmm, what if you make SCSI_DELAY larger? Also, can you let it fail > the > > > > > mount and drop into ddb and then get 'ps' output? > > > > > > > > As soon as I include the debugger into the kernel the problem > > > > is gone. I have double-checked it two times now: With debugger > > > > the drives are detected, without debugger mostly (but not always) > > > > not. > > > > > > > > I currently have it running in an endless rebooting loop hoping, > > > > that it fails eventually... > > > > > > Hummm. This seems like it is a timing related race. :( > > > > Success! Sometimes at night it finally panic'ed even with the > > debugger in the kernel. Here is the output of 'ps' and some other > > commands I remembered (no idea if any of these make sense in this > > context :-)). It is still in this state with the serial console > > attached so just tell me what to type ;-). > > > > > > KDB: enter: manual escape to debugger > > [thread pid 1 tid 100001 ] > > Stopped at kdb_enter_why+0x3b: xorl %eax,%eax > > db> ps > > pid ppid pgrp uid state wmesg wchan cmd > > 35 0 0 0 RL [softdepflush] > > 34 0 0 0 RL [syncer] > > 33 0 0 0 RL [vnlru] > > 32 0 0 0 RL [bufdaemon] > > 31 0 0 0 RL [pagezero] > > 30 0 0 0 RL [idlepoll] > > 29 0 0 0 RL [vmdaemon] > > 28 0 0 0 RL [pagedaemon] > > 27 0 0 0 WL [irq1: atkbd0] > > 26 0 0 0 WL [swi0: uart uart] > > 25 0 0 0 SL - 0xc182a63c [fdc0] > > 24 0 0 0 SL idle 0xc1829600 [aic_recovery0] > > 23 0 0 0 WL [irq11: ahc0] > > 22 0 0 0 SL idle 0xc1829600 [aic_recovery0] > > 21 0 0 0 WL [irq10: fxp0] > > 20 0 0 0 WL [irq9: acpi0 intsmb0] > > 19 0 0 0 SL - 0xc181b800 [kqueue taskq] > > 18 0 0 0 WL [swi6: task queue] > > 17 0 0 0 WL [swi6: Giant taskq] > > --More-- 9 0 0 0 RL [thread > taskq] > > 16 0 0 0 WL [swi5: Fast task queue] > > 15 0 0 0 WL [swi2: cambio] > > 8 0 0 0 SL ccb_scan 0xc0766714 [xpt_thrd] > > 7 0 0 0 SL - 0xc181bd80 [acpi_task_2] > > 6 0 0 0 SL - 0xc181bd80 [acpi_task_1] > > 5 0 0 0 SL - 0xc181bd80 [acpi_task_0] > > 14 0 0 0 SL - 0xc077be54 [yarrow] > > 4 0 0 0 SL - 0xc077942c [g_down] > > 3 0 0 0 SL - 0xc0779428 [g_up] > > 2 0 0 0 SL - 0xc0779420 [g_event] > > 13 0 0 0 WL [swi3: vm] > > 12 0 0 0 LL *Giant 0xc1821dc0 [swi4: clock] > > 11 0 0 0 WL [swi1: net] > > 10 0 0 0 RL [idle] > > 1 0 0 0 RL CPU 0 [swapper] > > 0 0 0 0 SLs sched 0xc07794c0 [swapper] > > Hummm, I don't see any "important" threads in a runnable state that I > would think were denied the ability to run by the priority change. I wonder > what callout swi4 is trying to run (or what other callouts might be > stuck waiting for Giant). I'd be tempted to add some KTR traces or > debugging printfs to try to figure out what sequence of events is > happening when (e.g., when the xpt thread kicks off each scan CCB, when > xpt_rescan_done() is called, and probably some events in the ahc driver). I am happy to try whatever you suggest ;-) But I don't want to steel too much of your time -- it seems that by reverting the initial commit everythings runs fine for me and I can easily live with that... Thanks, -Andre > > -- > John Baldwin -- UNIX is an operating system, OS/2 is half an operating system, Windows is a shell, and DOS is a bootsector virus.