Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Apr 2011 07:52:04 +0200
From:      Andre Albsmeier <Andre.Albsmeier@siemens.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        "svn-src-stable-7@freebsd.org" <svn-src-stable-7@freebsd.org>, "scsi@freebsd.org" <scsi@freebsd.org>
Subject:   Re: svn commit: r218277 - in stable/7/sys: kern sys
Message-ID:  <20110426055204.GA59975@curry.mchp.siemens.de>
In-Reply-To: <201104251703.18518.jhb@freebsd.org>
References:  <201102041444.p14EixJP087709@svn.freebsd.org> <201104190920.25924.jhb@freebsd.org> <20110420053226.GA22854@curry.mchp.siemens.de> <201104251703.18518.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 25-Apr-2011 at 23:03:18 +0200, John Baldwin wrote:
> On Wednesday, April 20, 2011 1:32:26 am Andre Albsmeier wrote:
> > On Tue, 19-Apr-2011 at 15:20:25 +0200, John Baldwin wrote:
> > > On Tuesday, April 19, 2011 8:56:48 am Andre Albsmeier wrote:
> > > > On Mon, 18-Apr-2011 at 15:18:25 +0200, John Baldwin wrote:
> > > > > On Monday, April 18, 2011 7:36:57 am Andre Albsmeier wrote:
> > > > > > On Fri, 15-Apr-2011 at 18:35:05 +0200, John Baldwin wrote:
> > > > > > > On Friday, April 15, 2011 9:25:25 am Andre Albsmeier wrote:
> > > > > > > > On Fri, 04-Feb-2011 at 14:44:59 +0000, John Baldwin wrote:
> > > > > > > > > Author: jhb
> > > > > > > > > Date: Fri Feb  4 14:44:59 2011
> > > > > > > > > New Revision: 218277
> > > > > > > > > URL: http://svn.freebsd.org/changeset/base/218277
> > > > > > > > > 
> > > > > > > > > Log:
> > > > > > > > >   MFC 217075:
> > > > > > > > >   Retire PCONFIG and leave the priority of thread0 alone when 
> waiting for
> > > > > > > > >   interrupt config hooks to execute.
> > > > > > > > >   
> > > > > > > > >   To preserve the KBI, I did not renumber priorities but 
> simply removed
> > > > > > > > >   PCONFIG.
> > > > > > > > > 
> > > > > > > > > Modified:
> > > > > > > > >   stable/7/sys/kern/subr_autoconf.c
> > > > > > > > >   stable/7/sys/sys/priority.h
> > > > > > > > > Directory Properties:
> > > > > > > > >   stable/7/sys/   (props changed)
> > > > > > > > >   stable/7/sys/cddl/contrib/opensolaris/   (props changed)
> > > > > > > > >   stable/7/sys/contrib/dev/acpica/   (props changed)
> > > > > > > > >   stable/7/sys/contrib/pf/   (props changed)
> > > > > > > > > 
> > > > > > > > > Modified: stable/7/sys/kern/subr_autoconf.c
> > > > > > > > > 
> > > > > > > 
> ==============================================================================
> > > > > > > > > --- stable/7/sys/kern/subr_autoconf.c	Fri Feb  4 14:44:42 
> 2011	
> > > > > > > (r218276)
> > > > > > > > > +++ stable/7/sys/kern/subr_autoconf.c	Fri Feb  4 14:44:59 
> 2011	
> > > > > > > (r218277)
> > > > > > > > > @@ -108,7 +108,7 @@ run_interrupt_driven_config_hooks(dummy)
> > > > > > > > >  	warned = 0;
> > > > > > > > >  	while (!TAILQ_EMPTY(&intr_config_hook_list)) {
> > > > > > > > >  		if (msleep(&intr_config_hook_list, 
> &intr_config_hook_lock,
> > > > > > > > > -		    PCONFIG, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > > > > > > +		    0, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > > > > > >  		    EWOULDBLOCK) {
> > > > > > > > >  			mtx_unlock(&intr_config_hook_lock);
> > > > > > > > >  			warned++;
> > > > > > > > 
> > > > > > > > 
> > > > > > > > This broke several of my machines in a somewhat strange way:
> > > > > > > > 
> > > > > > > > After upgrading them (17) to a recent 7-STABLE (as of 
> 2011-04-12)
> > > > > > > > I noticed that some (4) of them didn't start. All 4 didn't find
> > > > > > > > their boot device anymore. What they all got in common is:
> > > > > > > > 
> > > > > > > > - an Adaptec 2940 Ultra SCSI adapter
> > > > > > > > - two SCSI harddisks (da0 and da1) of various brands
> > > > > > > > - one SCSI CDROM drive (cd0)
> > > > > > > > 
> > > > > > > > To be exact, none of the three devices (da0, da1, cd0) were
> > > > > > > > detected at all. Other machines with a similar configuration
> > > > > > > > (2940 and da0/da1) but _without_ the CDROM drive didn't have
> > > > > > > > any problems. So I simply removed the CDROM drives on the 4
> > > > > > > > machines in question and they all booted again.
> > > > > > > > 
> > > > > > > > Today I decided to dig into this and after reverting(*) the
> > > > > > > > above change, they worked with the CDROM again. I have cross-
> > > > > > > > checked it 3 times. No idea what's happening here...
> > > > > > > > 
> > > > > > > > 	-Andre
> > > > > > > > 
> > > > > > > > (*) To be honest, I use this patch so I had to modify only one 
> file:
> > > > > > > > 
> > > > > > > > --- sys/kern/subr_autoconf.c.ORI	2011-02-05 13:14:11.000000000 
> +0100
> > > > > > > > +++ sys/kern/subr_autoconf.c	2011-04-15 14:34:31.000000000 
> +0200
> > > > > > > > @@ -108,7 +108,7 @@
> > > > > > > >  	warned = 0;
> > > > > > > >  	while (!TAILQ_EMPTY(&intr_config_hook_list)) {
> > > > > > > >  		if (msleep(&intr_config_hook_list, &intr_config_hook_lock,
> > > > > > > > -		    0, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > > > > > +		    PRI_MIN_KERN + 32, "conifhk", WARNING_INTERVAL_SECS * 
> hz) ==
> > > > > > > >  		    EWOULDBLOCK) {
> > > > > > > >  			mtx_unlock(&intr_config_hook_lock);
> > > > > > > >  			warned++;
> > > > > > > 
> > > > > > > Do you get any warnings about CAM timeouts, etc. when these probe?  
> A verbose 
> > > > > > > dmesg might be nice to look at if possible.
> > > > > > 
> > > > > > OK, I have set up a machine for testing. In my other mail
> > > > > > I was wrong saying that the pass devices appear when using
> > > > > > the problematic kernel...
> > > > > > 
> > > > > > Here are the dmesgs:
> > > > > > 
> > > > > > - dmesg_bad is the original kernel as of Friday
> > > > > > - dmesg_ok is the patched kernel (see above) as of Friday
> > > > > > - dmesg.diff is the diff between both
> > > > > > 
> > > > > > If you want me to try something just tell me...
> > > > > 
> > > > > Hmmm, what if you make SCSI_DELAY larger?  Also, can you let it fail 
> the
> > > > > mount and drop into ddb and then get 'ps' output?
> > > > 
> > > > As soon as I include the debugger into the kernel the problem
> > > > is gone. I have double-checked it two times now: With debugger
> > > > the drives are detected, without debugger mostly (but not always)
> > > > not.
> > > > 
> > > > I currently have it running in an endless rebooting loop hoping,
> > > > that it fails eventually...
> > > 
> > > Hummm.  This seems like it is a timing related race. :(
> > 
> > Success! Sometimes at night it finally panic'ed even with the
> > debugger in the kernel. Here is the output of 'ps' and some other
> > commands I remembered (no idea if any of these make sense in this
> > context :-)). It is still in this state with the serial console
> > attached so just tell me what to type ;-).
> > 
> > 
> > KDB: enter: manual escape to debugger
> > [thread pid 1 tid 100001 ]
> > Stopped at      kdb_enter_why+0x3b:     xorl    %eax,%eax
> > db> ps
> >   pid  ppid  pgrp   uid   state   wmesg     wchan    cmd
> >    35     0     0     0  RL                          [softdepflush]
> >    34     0     0     0  RL                          [syncer]
> >    33     0     0     0  RL                          [vnlru]
> >    32     0     0     0  RL                          [bufdaemon]
> >    31     0     0     0  RL                          [pagezero]
> >    30     0     0     0  RL                          [idlepoll]
> >    29     0     0     0  RL                          [vmdaemon]
> >    28     0     0     0  RL                          [pagedaemon]
> >    27     0     0     0  WL                          [irq1: atkbd0]
> >    26     0     0     0  WL                          [swi0: uart uart]
> >    25     0     0     0  SL      -        0xc182a63c [fdc0]
> >    24     0     0     0  SL      idle     0xc1829600 [aic_recovery0]
> >    23     0     0     0  WL                          [irq11: ahc0]
> >    22     0     0     0  SL      idle     0xc1829600 [aic_recovery0]
> >    21     0     0     0  WL                          [irq10: fxp0]
> >    20     0     0     0  WL                          [irq9: acpi0 intsmb0]
> >    19     0     0     0  SL      -        0xc181b800 [kqueue taskq]
> >    18     0     0     0  WL                          [swi6: task queue]
> >    17     0     0     0  WL                          [swi6: Giant taskq]
> > --More--            9     0     0     0  RL                          [thread 
> taskq]
> >    16     0     0     0  WL                          [swi5: Fast task queue]
> >    15     0     0     0  WL                          [swi2: cambio]
> >     8     0     0     0  SL      ccb_scan 0xc0766714 [xpt_thrd]
> >     7     0     0     0  SL      -        0xc181bd80 [acpi_task_2]
> >     6     0     0     0  SL      -        0xc181bd80 [acpi_task_1]
> >     5     0     0     0  SL      -        0xc181bd80 [acpi_task_0]
> >    14     0     0     0  SL      -        0xc077be54 [yarrow]
> >     4     0     0     0  SL      -        0xc077942c [g_down]
> >     3     0     0     0  SL      -        0xc0779428 [g_up]
> >     2     0     0     0  SL      -        0xc0779420 [g_event]
> >    13     0     0     0  WL                          [swi3: vm]
> >    12     0     0     0  LL     *Giant    0xc1821dc0 [swi4: clock]
> >    11     0     0     0  WL                          [swi1: net]
> >    10     0     0     0  RL                          [idle]
> >     1     0     0     0  RL      CPU 0               [swapper]
> >     0     0     0     0  SLs     sched    0xc07794c0 [swapper]
> 
> Hummm, I don't see any "important" threads in a runnable state that I
> would think were denied the ability to run by the priority change.  I wonder
> what callout swi4 is trying to run (or what other callouts might be
> stuck waiting for Giant).  I'd be tempted to add some KTR traces or
> debugging printfs to try to figure out what sequence of events is
> happening when (e.g., when the xpt thread kicks off each scan CCB, when 
> xpt_rescan_done() is called, and probably some events in the ahc driver).

I am happy to try whatever you suggest ;-) But I don't want to
steel too much of your time -- it seems that by reverting the
initial commit everythings runs fine for me and I can easily
live with that...

Thanks,

	-Andre

> 
> -- 
> John Baldwin

-- 
UNIX is an operating system, OS/2 is half an operating system,
Windows is a shell, and DOS is a bootsector virus.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110426055204.GA59975>