Date: Wed, 20 Apr 2011 07:32:26 +0200 From: Andre Albsmeier <Andre.Albsmeier@siemens.com> To: John Baldwin <jhb@freebsd.org> Cc: "svn-src-stable-7@freebsd.org" <svn-src-stable-7@freebsd.org>, "scsi@freebsd.org" <scsi@freebsd.org> Subject: Re: svn commit: r218277 - in stable/7/sys: kern sys Message-ID: <20110420053226.GA22854@curry.mchp.siemens.de> In-Reply-To: <201104190920.25924.jhb@freebsd.org> References: <201102041444.p14EixJP087709@svn.freebsd.org> <201104180918.26054.jhb@freebsd.org> <20110419125648.GA17780@curry.mchp.siemens.de> <201104190920.25924.jhb@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 19-Apr-2011 at 15:20:25 +0200, John Baldwin wrote: > On Tuesday, April 19, 2011 8:56:48 am Andre Albsmeier wrote: > > On Mon, 18-Apr-2011 at 15:18:25 +0200, John Baldwin wrote: > > > On Monday, April 18, 2011 7:36:57 am Andre Albsmeier wrote: > > > > On Fri, 15-Apr-2011 at 18:35:05 +0200, John Baldwin wrote: > > > > > On Friday, April 15, 2011 9:25:25 am Andre Albsmeier wrote: > > > > > > On Fri, 04-Feb-2011 at 14:44:59 +0000, John Baldwin wrote: > > > > > > > Author: jhb > > > > > > > Date: Fri Feb 4 14:44:59 2011 > > > > > > > New Revision: 218277 > > > > > > > URL: http://svn.freebsd.org/changeset/base/218277 > > > > > > > > > > > > > > Log: > > > > > > > MFC 217075: > > > > > > > Retire PCONFIG and leave the priority of thread0 alone when waiting for > > > > > > > interrupt config hooks to execute. > > > > > > > > > > > > > > To preserve the KBI, I did not renumber priorities but simply removed > > > > > > > PCONFIG. > > > > > > > > > > > > > > Modified: > > > > > > > stable/7/sys/kern/subr_autoconf.c > > > > > > > stable/7/sys/sys/priority.h > > > > > > > Directory Properties: > > > > > > > stable/7/sys/ (props changed) > > > > > > > stable/7/sys/cddl/contrib/opensolaris/ (props changed) > > > > > > > stable/7/sys/contrib/dev/acpica/ (props changed) > > > > > > > stable/7/sys/contrib/pf/ (props changed) > > > > > > > > > > > > > > Modified: stable/7/sys/kern/subr_autoconf.c > > > > > > > > > > > > ============================================================================== > > > > > > > --- stable/7/sys/kern/subr_autoconf.c Fri Feb 4 14:44:42 2011 > > > > > (r218276) > > > > > > > +++ stable/7/sys/kern/subr_autoconf.c Fri Feb 4 14:44:59 2011 > > > > > (r218277) > > > > > > > @@ -108,7 +108,7 @@ run_interrupt_driven_config_hooks(dummy) > > > > > > > warned = 0; > > > > > > > while (!TAILQ_EMPTY(&intr_config_hook_list)) { > > > > > > > if (msleep(&intr_config_hook_list, &intr_config_hook_lock, > > > > > > > - PCONFIG, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > + 0, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > > EWOULDBLOCK) { > > > > > > > mtx_unlock(&intr_config_hook_lock); > > > > > > > warned++; > > > > > > > > > > > > > > > > > > This broke several of my machines in a somewhat strange way: > > > > > > > > > > > > After upgrading them (17) to a recent 7-STABLE (as of 2011-04-12) > > > > > > I noticed that some (4) of them didn't start. All 4 didn't find > > > > > > their boot device anymore. What they all got in common is: > > > > > > > > > > > > - an Adaptec 2940 Ultra SCSI adapter > > > > > > - two SCSI harddisks (da0 and da1) of various brands > > > > > > - one SCSI CDROM drive (cd0) > > > > > > > > > > > > To be exact, none of the three devices (da0, da1, cd0) were > > > > > > detected at all. Other machines with a similar configuration > > > > > > (2940 and da0/da1) but _without_ the CDROM drive didn't have > > > > > > any problems. So I simply removed the CDROM drives on the 4 > > > > > > machines in question and they all booted again. > > > > > > > > > > > > Today I decided to dig into this and after reverting(*) the > > > > > > above change, they worked with the CDROM again. I have cross- > > > > > > checked it 3 times. No idea what's happening here... > > > > > > > > > > > > -Andre > > > > > > > > > > > > (*) To be honest, I use this patch so I had to modify only one file: > > > > > > > > > > > > --- sys/kern/subr_autoconf.c.ORI 2011-02-05 13:14:11.000000000 +0100 > > > > > > +++ sys/kern/subr_autoconf.c 2011-04-15 14:34:31.000000000 +0200 > > > > > > @@ -108,7 +108,7 @@ > > > > > > warned = 0; > > > > > > while (!TAILQ_EMPTY(&intr_config_hook_list)) { > > > > > > if (msleep(&intr_config_hook_list, &intr_config_hook_lock, > > > > > > - 0, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > + PRI_MIN_KERN + 32, "conifhk", WARNING_INTERVAL_SECS * hz) == > > > > > > EWOULDBLOCK) { > > > > > > mtx_unlock(&intr_config_hook_lock); > > > > > > warned++; > > > > > > > > > > Do you get any warnings about CAM timeouts, etc. when these probe? A verbose > > > > > dmesg might be nice to look at if possible. > > > > > > > > OK, I have set up a machine for testing. In my other mail > > > > I was wrong saying that the pass devices appear when using > > > > the problematic kernel... > > > > > > > > Here are the dmesgs: > > > > > > > > - dmesg_bad is the original kernel as of Friday > > > > - dmesg_ok is the patched kernel (see above) as of Friday > > > > - dmesg.diff is the diff between both > > > > > > > > If you want me to try something just tell me... > > > > > > Hmmm, what if you make SCSI_DELAY larger? Also, can you let it fail the > > > mount and drop into ddb and then get 'ps' output? > > > > As soon as I include the debugger into the kernel the problem > > is gone. I have double-checked it two times now: With debugger > > the drives are detected, without debugger mostly (but not always) > > not. > > > > I currently have it running in an endless rebooting loop hoping, > > that it fails eventually... > > Hummm. This seems like it is a timing related race. :( Success! Sometimes at night it finally panic'ed even with the debugger in the kernel. Here is the output of 'ps' and some other commands I remembered (no idea if any of these make sense in this context :-)). It is still in this state with the serial console attached so just tell me what to type ;-). KDB: enter: manual escape to debugger [thread pid 1 tid 100001 ] Stopped at kdb_enter_why+0x3b: xorl %eax,%eax db> ps pid ppid pgrp uid state wmesg wchan cmd 35 0 0 0 RL [softdepflush] 34 0 0 0 RL [syncer] 33 0 0 0 RL [vnlru] 32 0 0 0 RL [bufdaemon] 31 0 0 0 RL [pagezero] 30 0 0 0 RL [idlepoll] 29 0 0 0 RL [vmdaemon] 28 0 0 0 RL [pagedaemon] 27 0 0 0 WL [irq1: atkbd0] 26 0 0 0 WL [swi0: uart uart] 25 0 0 0 SL - 0xc182a63c [fdc0] 24 0 0 0 SL idle 0xc1829600 [aic_recovery0] 23 0 0 0 WL [irq11: ahc0] 22 0 0 0 SL idle 0xc1829600 [aic_recovery0] 21 0 0 0 WL [irq10: fxp0] 20 0 0 0 WL [irq9: acpi0 intsmb0] 19 0 0 0 SL - 0xc181b800 [kqueue taskq] 18 0 0 0 WL [swi6: task queue] 17 0 0 0 WL [swi6: Giant taskq] --More-- 9 0 0 0 RL [thread taskq] 16 0 0 0 WL [swi5: Fast task queue] 15 0 0 0 WL [swi2: cambio] 8 0 0 0 SL ccb_scan 0xc0766714 [xpt_thrd] 7 0 0 0 SL - 0xc181bd80 [acpi_task_2] 6 0 0 0 SL - 0xc181bd80 [acpi_task_1] 5 0 0 0 SL - 0xc181bd80 [acpi_task_0] 14 0 0 0 SL - 0xc077be54 [yarrow] 4 0 0 0 SL - 0xc077942c [g_down] 3 0 0 0 SL - 0xc0779428 [g_up] 2 0 0 0 SL - 0xc0779420 [g_event] 13 0 0 0 WL [swi3: vm] 12 0 0 0 LL *Giant 0xc1821dc0 [swi4: clock] 11 0 0 0 WL [swi1: net] 10 0 0 0 RL [idle] 1 0 0 0 RL CPU 0 [swapper] 0 0 0 0 SLs sched 0xc07794c0 [swapper] db> show threads 100035 (0xc1a474c0) fork_trampoline() at fork_trampoline 100034 (0xc19a8000) fork_trampoline() at fork_trampoline 100033 (0xc19a8260) fork_trampoline() at fork_trampoline 100032 (0xc19a84c0) fork_trampoline() at fork_trampoline 100031 (0xc19a8720) fork_trampoline() at fork_trampoline 100030 (0xc19a8980) fork_trampoline() at fork_trampoline 100029 (0xc19a8be0) fork_trampoline() at fork_trampoline 100028 (0xc19a9000) fork_trampoline() at fork_trampoline 100027 (0xc19a9260) fork_trampoline() at fork_trampoline 100026 (0xc19a94c0) fork_trampoline() at fork_trampoline 100025 (0xc19a9720) sched_switch(c19a9720,0,1,7a3b3f3c,11,...) at sched_switch+0xa0 100024 (0xc1855720) sched_switch(c1855720,0,1,2419c5c9,11,...) at sched_switch+0xa0 100023 (0xc1855980) sched_switch(c1855980,0,1,b2704d4a,11,...) at sched_switch+0xa0 100022 (0xc1855be0) sched_switch(c1855be0,0,1,2419b12b,11,...) at sched_switch+0xa0 100021 (0xc18ad000) fork_trampoline() at fork_trampoline 100020 (0xc18ad260) fork_trampoline() at fork_trampoline 100019 (0xc18ad4c0) sched_switch(c18ad4c0,0,1,241f3621,11,...) at sched_switch+0xa0 100018 (0xc18ad720) fork_trampoline() at fork_trampoline 100017 (0xc18ad980) fork_trampoline() at fork_trampoline 100016 (0xc18adbe0) sched_switch(c18adbe0,0,4,b27f7ade,11,...) at sched_switch+0xa0 --More-- 100015 (0xc183e260) fork_trampoline() at fork_trampoline 100014 (0xc183e4c0) sched_switch(c183e4c0,0,1,b270879a,11,...) at sched_switch+0xa0 100013 (0xc183e720) sched_switch(c183e720,0,1,24199731,11,...) at sched_switch+0xa0 100012 (0xc183e980) sched_switch(c183e980,0,1,241f0bf7,11,...) at sched_switch+0xa0 100011 (0xc183ebe0) sched_switch(c183ebe0,0,1,241ef695,11,...) at sched_switch+0xa0 100010 (0xc1855000) sched_switch(c1855000,0,1,241ee335,11,...) at sched_switch+0xa0 100009 (0xc1855260) sched_switch(c1855260,0,1,b01ec712,11,...) at sched_switch+0xa0 100008 (0xc18554c0) sched_switch(c18554c0,0,1,241960e3,11,...) at sched_switch+0xa0 100007 (0xc183d000) sched_switch(c183d000,0,1,241943af,11,...) at sched_switch+0xa0 100006 (0xc183d260) sched_switch(c183d260,0,1,b01ecf55,11,...) at sched_switch+0xa0 100005 (0xc183d4c0) fork_trampoline() at fork_trampoline 100004 (0xc183d720) sched_switch(c183d720,0,1,b357cafa,11,...) at sched_switch+0xa0 100003 (0xc183d980) fork_trampoline() at fork_trampoline 100002 (0xc183dbe0) sched_switch(c183dbe0,0,6,b2703e6d,11,...) at sched_switch+0xa0 100001 (0xc183e000) kdb_enter_why(c06feb01,c0708b8f,ffffffff,c17e7b6c,c06bf6f1,...) at kdb_enter_why+0x3b 100000 (0xc07797a0) sched_switch(c07797a0,0,1,b28c7b89,11,...) at sched_switch+0xa0 db> show thread Thread 100001 at 0xc183e000: proc (pid 1): 0xc183c000 flags: 0x10005 pflags: 0 state: RUNNING (CPU 0) priority: 52 db> show geom class: FD (0xc0758720) geom: fd0 (0xc19e9880), rank=1 provider: fd0 (0xc19e9800), access=r0w0e0 consumer: 0xc19e4280 (fd0), access=r0w0e0 class: DEV (0xc073bca0) geom: fd0 (0xc19e9680), rank=2 consumer: 0xc19e4280 (fd0), access=r0w0e0 class: PART (0xc073c4a0) class: VFS (0xc073c3a0) class: MBR (0xc073c320) class: MBREXT (0xc073c2c0) class: BSD (0xc073bbc0) class: MD (0xc0737400) class: SWAP (0xc0754e20) class: DISK (0xc073bda0) db> trace Tracing pid 1 tid 100001 td 0xc183e000 kdb_enter_why(c06feb01,c0708b8f,ffffffff,c17e7b6c,c06bf6f1,...) at kdb_enter_why+0x3b scgetc(c07866e0,1,c07790c4,c07825e0,c17e7bd4,...) at scgetc+0x47d sc_cngetc(c0739ea0,0,c17e7ba0,c059568a,c17e7bc0,...) at sc_cngetc+0xe1 cncheckc(c17e7bc0,c05ddd55,0,0,c17e7c53,...) at cncheckc+0x58 cngetc(0,0,c17e7c53,c17e7bd4,c1a411b0,...) at cngetc+0x1a gets(c17e7bd4,80,1,0,0,...) at gets+0x25 vfs_mountroot_ask(c1a411b0,c0709785,c0715a97,1,c05c3910,...) at vfs_mountroot_ask+0x72 vfs_mountroot(0,0,0,0,0,...) at vfs_mountroot+0x39c start_init(0,c17e7d38,0,0,0,...) at start_init+0x3c fork_exit(c0504f70,0,c17e7d38) at fork_exit+0x94 fork_trampoline() at fork_trampoline+0x8 --- trap 0, eip = 0, esp = 0xc17e7d70, ebp = 0 --- db> show proc 8 Process 8 (xpt_thrd) at 0xc183c2dc: state: NORMAL uid: 0 gids: 0 parent: pid 0 at 0xc07794c0 ABI: null threads: 1 100013 D ccb_scan 0xc0766714 [xpt_thrd] db> show proc 23 Process 23 (irq11: ahc0) at 0xc1996894: state: NORMAL uid: 0 gids: 0 parent: pid 0 at 0xc07794c0 ABI: null threads: 1 100023 I [irq11: ahc0] Thanks, -Andre
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110420053226.GA22854>