Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 20 Apr 2011 07:32:26 +0200
From:      Andre Albsmeier <Andre.Albsmeier@siemens.com>
To:        John Baldwin <jhb@freebsd.org>
Cc:        "svn-src-stable-7@freebsd.org" <svn-src-stable-7@freebsd.org>, "scsi@freebsd.org" <scsi@freebsd.org>
Subject:   Re: svn commit: r218277 - in stable/7/sys: kern sys
Message-ID:  <20110420053226.GA22854@curry.mchp.siemens.de>
In-Reply-To: <201104190920.25924.jhb@freebsd.org>
References:  <201102041444.p14EixJP087709@svn.freebsd.org> <201104180918.26054.jhb@freebsd.org> <20110419125648.GA17780@curry.mchp.siemens.de> <201104190920.25924.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 19-Apr-2011 at 15:20:25 +0200, John Baldwin wrote:
> On Tuesday, April 19, 2011 8:56:48 am Andre Albsmeier wrote:
> > On Mon, 18-Apr-2011 at 15:18:25 +0200, John Baldwin wrote:
> > > On Monday, April 18, 2011 7:36:57 am Andre Albsmeier wrote:
> > > > On Fri, 15-Apr-2011 at 18:35:05 +0200, John Baldwin wrote:
> > > > > On Friday, April 15, 2011 9:25:25 am Andre Albsmeier wrote:
> > > > > > On Fri, 04-Feb-2011 at 14:44:59 +0000, John Baldwin wrote:
> > > > > > > Author: jhb
> > > > > > > Date: Fri Feb  4 14:44:59 2011
> > > > > > > New Revision: 218277
> > > > > > > URL: http://svn.freebsd.org/changeset/base/218277
> > > > > > > 
> > > > > > > Log:
> > > > > > >   MFC 217075:
> > > > > > >   Retire PCONFIG and leave the priority of thread0 alone when waiting for
> > > > > > >   interrupt config hooks to execute.
> > > > > > >   
> > > > > > >   To preserve the KBI, I did not renumber priorities but simply removed
> > > > > > >   PCONFIG.
> > > > > > > 
> > > > > > > Modified:
> > > > > > >   stable/7/sys/kern/subr_autoconf.c
> > > > > > >   stable/7/sys/sys/priority.h
> > > > > > > Directory Properties:
> > > > > > >   stable/7/sys/   (props changed)
> > > > > > >   stable/7/sys/cddl/contrib/opensolaris/   (props changed)
> > > > > > >   stable/7/sys/contrib/dev/acpica/   (props changed)
> > > > > > >   stable/7/sys/contrib/pf/   (props changed)
> > > > > > > 
> > > > > > > Modified: stable/7/sys/kern/subr_autoconf.c
> > > > > > > 
> > > > > ==============================================================================
> > > > > > > --- stable/7/sys/kern/subr_autoconf.c	Fri Feb  4 14:44:42 2011	
> > > > > (r218276)
> > > > > > > +++ stable/7/sys/kern/subr_autoconf.c	Fri Feb  4 14:44:59 2011	
> > > > > (r218277)
> > > > > > > @@ -108,7 +108,7 @@ run_interrupt_driven_config_hooks(dummy)
> > > > > > >  	warned = 0;
> > > > > > >  	while (!TAILQ_EMPTY(&intr_config_hook_list)) {
> > > > > > >  		if (msleep(&intr_config_hook_list, &intr_config_hook_lock,
> > > > > > > -		    PCONFIG, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > > > > +		    0, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > > > >  		    EWOULDBLOCK) {
> > > > > > >  			mtx_unlock(&intr_config_hook_lock);
> > > > > > >  			warned++;
> > > > > > 
> > > > > > 
> > > > > > This broke several of my machines in a somewhat strange way:
> > > > > > 
> > > > > > After upgrading them (17) to a recent 7-STABLE (as of 2011-04-12)
> > > > > > I noticed that some (4) of them didn't start. All 4 didn't find
> > > > > > their boot device anymore. What they all got in common is:
> > > > > > 
> > > > > > - an Adaptec 2940 Ultra SCSI adapter
> > > > > > - two SCSI harddisks (da0 and da1) of various brands
> > > > > > - one SCSI CDROM drive (cd0)
> > > > > > 
> > > > > > To be exact, none of the three devices (da0, da1, cd0) were
> > > > > > detected at all. Other machines with a similar configuration
> > > > > > (2940 and da0/da1) but _without_ the CDROM drive didn't have
> > > > > > any problems. So I simply removed the CDROM drives on the 4
> > > > > > machines in question and they all booted again.
> > > > > > 
> > > > > > Today I decided to dig into this and after reverting(*) the
> > > > > > above change, they worked with the CDROM again. I have cross-
> > > > > > checked it 3 times. No idea what's happening here...
> > > > > > 
> > > > > > 	-Andre
> > > > > > 
> > > > > > (*) To be honest, I use this patch so I had to modify only one file:
> > > > > > 
> > > > > > --- sys/kern/subr_autoconf.c.ORI	2011-02-05 13:14:11.000000000 +0100
> > > > > > +++ sys/kern/subr_autoconf.c	2011-04-15 14:34:31.000000000 +0200
> > > > > > @@ -108,7 +108,7 @@
> > > > > >  	warned = 0;
> > > > > >  	while (!TAILQ_EMPTY(&intr_config_hook_list)) {
> > > > > >  		if (msleep(&intr_config_hook_list, &intr_config_hook_lock,
> > > > > > -		    0, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > > > +		    PRI_MIN_KERN + 32, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > > >  		    EWOULDBLOCK) {
> > > > > >  			mtx_unlock(&intr_config_hook_lock);
> > > > > >  			warned++;
> > > > > 
> > > > > Do you get any warnings about CAM timeouts, etc. when these probe?  A verbose 
> > > > > dmesg might be nice to look at if possible.
> > > > 
> > > > OK, I have set up a machine for testing. In my other mail
> > > > I was wrong saying that the pass devices appear when using
> > > > the problematic kernel...
> > > > 
> > > > Here are the dmesgs:
> > > > 
> > > > - dmesg_bad is the original kernel as of Friday
> > > > - dmesg_ok is the patched kernel (see above) as of Friday
> > > > - dmesg.diff is the diff between both
> > > > 
> > > > If you want me to try something just tell me...
> > > 
> > > Hmmm, what if you make SCSI_DELAY larger?  Also, can you let it fail the
> > > mount and drop into ddb and then get 'ps' output?
> > 
> > As soon as I include the debugger into the kernel the problem
> > is gone. I have double-checked it two times now: With debugger
> > the drives are detected, without debugger mostly (but not always)
> > not.
> > 
> > I currently have it running in an endless rebooting loop hoping,
> > that it fails eventually...
> 
> Hummm.  This seems like it is a timing related race. :(

Success! Sometimes at night it finally panic'ed even with the
debugger in the kernel. Here is the output of 'ps' and some other
commands I remembered (no idea if any of these make sense in this
context :-)). It is still in this state with the serial console
attached so just tell me what to type ;-).


KDB: enter: manual escape to debugger
[thread pid 1 tid 100001 ]
Stopped at      kdb_enter_why+0x3b:     xorl    %eax,%eax
db> ps
  pid  ppid  pgrp   uid   state   wmesg     wchan    cmd
   35     0     0     0  RL                          [softdepflush]
   34     0     0     0  RL                          [syncer]
   33     0     0     0  RL                          [vnlru]
   32     0     0     0  RL                          [bufdaemon]
   31     0     0     0  RL                          [pagezero]
   30     0     0     0  RL                          [idlepoll]
   29     0     0     0  RL                          [vmdaemon]
   28     0     0     0  RL                          [pagedaemon]
   27     0     0     0  WL                          [irq1: atkbd0]
   26     0     0     0  WL                          [swi0: uart uart]
   25     0     0     0  SL      -        0xc182a63c [fdc0]
   24     0     0     0  SL      idle     0xc1829600 [aic_recovery0]
   23     0     0     0  WL                          [irq11: ahc0]
   22     0     0     0  SL      idle     0xc1829600 [aic_recovery0]
   21     0     0     0  WL                          [irq10: fxp0]
   20     0     0     0  WL                          [irq9: acpi0 intsmb0]
   19     0     0     0  SL      -        0xc181b800 [kqueue taskq]
   18     0     0     0  WL                          [swi6: task queue]
   17     0     0     0  WL                          [swi6: Giant taskq]
--More--            9     0     0     0  RL                          [thread taskq]
   16     0     0     0  WL                          [swi5: Fast task queue]
   15     0     0     0  WL                          [swi2: cambio]
    8     0     0     0  SL      ccb_scan 0xc0766714 [xpt_thrd]
    7     0     0     0  SL      -        0xc181bd80 [acpi_task_2]
    6     0     0     0  SL      -        0xc181bd80 [acpi_task_1]
    5     0     0     0  SL      -        0xc181bd80 [acpi_task_0]
   14     0     0     0  SL      -        0xc077be54 [yarrow]
    4     0     0     0  SL      -        0xc077942c [g_down]
    3     0     0     0  SL      -        0xc0779428 [g_up]
    2     0     0     0  SL      -        0xc0779420 [g_event]
   13     0     0     0  WL                          [swi3: vm]
   12     0     0     0  LL     *Giant    0xc1821dc0 [swi4: clock]
   11     0     0     0  WL                          [swi1: net]
   10     0     0     0  RL                          [idle]
    1     0     0     0  RL      CPU 0               [swapper]
    0     0     0     0  SLs     sched    0xc07794c0 [swapper]
db> show threads
  100035 (0xc1a474c0)  fork_trampoline() at fork_trampoline
  100034 (0xc19a8000)  fork_trampoline() at fork_trampoline
  100033 (0xc19a8260)  fork_trampoline() at fork_trampoline
  100032 (0xc19a84c0)  fork_trampoline() at fork_trampoline
  100031 (0xc19a8720)  fork_trampoline() at fork_trampoline
  100030 (0xc19a8980)  fork_trampoline() at fork_trampoline
  100029 (0xc19a8be0)  fork_trampoline() at fork_trampoline
  100028 (0xc19a9000)  fork_trampoline() at fork_trampoline
  100027 (0xc19a9260)  fork_trampoline() at fork_trampoline
  100026 (0xc19a94c0)  fork_trampoline() at fork_trampoline
  100025 (0xc19a9720)  sched_switch(c19a9720,0,1,7a3b3f3c,11,...) at sched_switch+0xa0
  100024 (0xc1855720)  sched_switch(c1855720,0,1,2419c5c9,11,...) at sched_switch+0xa0
  100023 (0xc1855980)  sched_switch(c1855980,0,1,b2704d4a,11,...) at sched_switch+0xa0
  100022 (0xc1855be0)  sched_switch(c1855be0,0,1,2419b12b,11,...) at sched_switch+0xa0
  100021 (0xc18ad000)  fork_trampoline() at fork_trampoline
  100020 (0xc18ad260)  fork_trampoline() at fork_trampoline
  100019 (0xc18ad4c0)  sched_switch(c18ad4c0,0,1,241f3621,11,...) at sched_switch+0xa0
  100018 (0xc18ad720)  fork_trampoline() at fork_trampoline
  100017 (0xc18ad980)  fork_trampoline() at fork_trampoline
  100016 (0xc18adbe0)  sched_switch(c18adbe0,0,4,b27f7ade,11,...) at sched_switch+0xa0
--More--          100015 (0xc183e260)  fork_trampoline() at fork_trampoline
  100014 (0xc183e4c0)  sched_switch(c183e4c0,0,1,b270879a,11,...) at sched_switch+0xa0
  100013 (0xc183e720)  sched_switch(c183e720,0,1,24199731,11,...) at sched_switch+0xa0
  100012 (0xc183e980)  sched_switch(c183e980,0,1,241f0bf7,11,...) at sched_switch+0xa0
  100011 (0xc183ebe0)  sched_switch(c183ebe0,0,1,241ef695,11,...) at sched_switch+0xa0
  100010 (0xc1855000)  sched_switch(c1855000,0,1,241ee335,11,...) at sched_switch+0xa0
  100009 (0xc1855260)  sched_switch(c1855260,0,1,b01ec712,11,...) at sched_switch+0xa0
  100008 (0xc18554c0)  sched_switch(c18554c0,0,1,241960e3,11,...) at sched_switch+0xa0
  100007 (0xc183d000)  sched_switch(c183d000,0,1,241943af,11,...) at sched_switch+0xa0
  100006 (0xc183d260)  sched_switch(c183d260,0,1,b01ecf55,11,...) at sched_switch+0xa0
  100005 (0xc183d4c0)  fork_trampoline() at fork_trampoline
  100004 (0xc183d720)  sched_switch(c183d720,0,1,b357cafa,11,...) at sched_switch+0xa0
  100003 (0xc183d980)  fork_trampoline() at fork_trampoline
  100002 (0xc183dbe0)  sched_switch(c183dbe0,0,6,b2703e6d,11,...) at sched_switch+0xa0
  100001 (0xc183e000)  kdb_enter_why(c06feb01,c0708b8f,ffffffff,c17e7b6c,c06bf6f1,...) at kdb_enter_why+0x3b
  100000 (0xc07797a0)  sched_switch(c07797a0,0,1,b28c7b89,11,...) at sched_switch+0xa0
db> show thread
Thread 100001 at 0xc183e000:
 proc (pid 1): 0xc183c000
 flags: 0x10005  pflags: 0
 state: RUNNING (CPU 0)
 priority: 52
db> show geom
class: FD (0xc0758720)
  geom: fd0 (0xc19e9880), rank=1
    provider: fd0 (0xc19e9800), access=r0w0e0
      consumer: 0xc19e4280 (fd0), access=r0w0e0

class: DEV (0xc073bca0)
  geom: fd0 (0xc19e9680), rank=2
    consumer: 0xc19e4280 (fd0), access=r0w0e0

class: PART (0xc073c4a0)

class: VFS (0xc073c3a0)

class: MBR (0xc073c320)

class: MBREXT (0xc073c2c0)

class: BSD (0xc073bbc0)

class: MD (0xc0737400)

class: SWAP (0xc0754e20)

class: DISK (0xc073bda0)

db> trace
Tracing pid 1 tid 100001 td 0xc183e000
kdb_enter_why(c06feb01,c0708b8f,ffffffff,c17e7b6c,c06bf6f1,...) at kdb_enter_why+0x3b
scgetc(c07866e0,1,c07790c4,c07825e0,c17e7bd4,...) at scgetc+0x47d
sc_cngetc(c0739ea0,0,c17e7ba0,c059568a,c17e7bc0,...) at sc_cngetc+0xe1
cncheckc(c17e7bc0,c05ddd55,0,0,c17e7c53,...) at cncheckc+0x58
cngetc(0,0,c17e7c53,c17e7bd4,c1a411b0,...) at cngetc+0x1a
gets(c17e7bd4,80,1,0,0,...) at gets+0x25
vfs_mountroot_ask(c1a411b0,c0709785,c0715a97,1,c05c3910,...) at vfs_mountroot_ask+0x72
vfs_mountroot(0,0,0,0,0,...) at vfs_mountroot+0x39c
start_init(0,c17e7d38,0,0,0,...) at start_init+0x3c
fork_exit(c0504f70,0,c17e7d38) at fork_exit+0x94
fork_trampoline() at fork_trampoline+0x8
--- trap 0, eip = 0, esp = 0xc17e7d70, ebp = 0 ---
db> show proc 8
Process 8 (xpt_thrd) at 0xc183c2dc:
 state: NORMAL
 uid: 0  gids: 0
 parent: pid 0 at 0xc07794c0
 ABI: null
 threads: 1
100013                   D       ccb_scan 0xc0766714 [xpt_thrd]
db> show proc 23
Process 23 (irq11: ahc0) at 0xc1996894:
 state: NORMAL
 uid: 0  gids: 0
 parent: pid 0 at 0xc07794c0
 ABI: null
 threads: 1
100023                   I                           [irq11: ahc0]


Thanks,

	-Andre



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110420053226.GA22854>