From owner-freebsd-scsi@FreeBSD.ORG  Mon Apr 18 17:33:35 2011
Return-Path: <owner-freebsd-scsi@FreeBSD.ORG>
Delivered-To: scsi@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4B2FA1065674
	for <scsi@freebsd.org>; Mon, 18 Apr 2011 17:33:35 +0000 (UTC)
	(envelope-from Andre.Albsmeier@siemens.com)
Received: from goliath.siemens.de (goliath.siemens.de [192.35.17.28])
	by mx1.freebsd.org (Postfix) with ESMTP id C37148FC13
	for <scsi@freebsd.org>; Mon, 18 Apr 2011 17:33:34 +0000 (UTC)
Received: from mail3.siemens.de (localhost [127.0.0.1])
	by goliath.siemens.de (8.13.6/8.13.6) with ESMTP id p3IHKWqG001226;
	Mon, 18 Apr 2011 19:20:32 +0200
Received: from curry.mchp.siemens.de (curry.mchp.siemens.de [139.25.40.130])
	by mail3.siemens.de (8.13.6/8.13.6) with ESMTP id p3IHKW5N000886;
	Mon, 18 Apr 2011 19:20:32 +0200
Received: (from localhost)
	by curry.mchp.siemens.de (8.14.4/8.14.4) id p3IHKWUc017794;
Date: Mon, 18 Apr 2011 19:20:32 +0200
From: Andre Albsmeier <Andre.Albsmeier@siemens.com>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <20110418172032.GA8849@curry.mchp.siemens.de>
References: <201102041444.p14EixJP087709@svn.freebsd.org>
	<201104151235.05114.jhb@freebsd.org>
	<20110418113657.GA6080@curry.mchp.siemens.de>
	<201104180918.26054.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201104180918.26054.jhb@freebsd.org>
X-Echelon: <censored>
X-Advice: Drop that crappy M$-Outlook, I'm tired of your viruses!
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: "svn-src-stable-7@freebsd.org" <svn-src-stable-7@freebsd.org>,
	"scsi@freebsd.org" <scsi@freebsd.org>
Subject: Re: svn commit: r218277 - in stable/7/sys: kern sys
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
	<mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Apr 2011 17:33:35 -0000

On Mon, 18-Apr-2011 at 15:18:25 +0200, John Baldwin wrote:
> On Monday, April 18, 2011 7:36:57 am Andre Albsmeier wrote:
> > On Fri, 15-Apr-2011 at 18:35:05 +0200, John Baldwin wrote:
> > > On Friday, April 15, 2011 9:25:25 am Andre Albsmeier wrote:
> > > > On Fri, 04-Feb-2011 at 14:44:59 +0000, John Baldwin wrote:
> > > > > Author: jhb
> > > > > Date: Fri Feb  4 14:44:59 2011
> > > > > New Revision: 218277
> > > > > URL: http://svn.freebsd.org/changeset/base/218277
> > > > > 
> > > > > Log:
> > > > >   MFC 217075:
> > > > >   Retire PCONFIG and leave the priority of thread0 alone when waiting for
> > > > >   interrupt config hooks to execute.
> > > > >   
> > > > >   To preserve the KBI, I did not renumber priorities but simply removed
> > > > >   PCONFIG.
> > > > > 
> > > > > Modified:
> > > > >   stable/7/sys/kern/subr_autoconf.c
> > > > >   stable/7/sys/sys/priority.h
> > > > > Directory Properties:
> > > > >   stable/7/sys/   (props changed)
> > > > >   stable/7/sys/cddl/contrib/opensolaris/   (props changed)
> > > > >   stable/7/sys/contrib/dev/acpica/   (props changed)
> > > > >   stable/7/sys/contrib/pf/   (props changed)
> > > > > 
> > > > > Modified: stable/7/sys/kern/subr_autoconf.c
> > > > > 
> > > ==============================================================================
> > > > > --- stable/7/sys/kern/subr_autoconf.c	Fri Feb  4 14:44:42 2011	
> > > (r218276)
> > > > > +++ stable/7/sys/kern/subr_autoconf.c	Fri Feb  4 14:44:59 2011	
> > > (r218277)
> > > > > @@ -108,7 +108,7 @@ run_interrupt_driven_config_hooks(dummy)
> > > > >  	warned = 0;
> > > > >  	while (!TAILQ_EMPTY(&intr_config_hook_list)) {
> > > > >  		if (msleep(&intr_config_hook_list, &intr_config_hook_lock,
> > > > > -		    PCONFIG, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > > +		    0, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > >  		    EWOULDBLOCK) {
> > > > >  			mtx_unlock(&intr_config_hook_lock);
> > > > >  			warned++;
> > > > 
> > > > 
> > > > This broke several of my machines in a somewhat strange way:
> > > > 
> > > > After upgrading them (17) to a recent 7-STABLE (as of 2011-04-12)
> > > > I noticed that some (4) of them didn't start. All 4 didn't find
> > > > their boot device anymore. What they all got in common is:
> > > > 
> > > > - an Adaptec 2940 Ultra SCSI adapter
> > > > - two SCSI harddisks (da0 and da1) of various brands
> > > > - one SCSI CDROM drive (cd0)
> > > > 
> > > > To be exact, none of the three devices (da0, da1, cd0) were
> > > > detected at all. Other machines with a similar configuration
> > > > (2940 and da0/da1) but _without_ the CDROM drive didn't have
> > > > any problems. So I simply removed the CDROM drives on the 4
> > > > machines in question and they all booted again.
> > > > 
> > > > Today I decided to dig into this and after reverting(*) the
> > > > above change, they worked with the CDROM again. I have cross-
> > > > checked it 3 times. No idea what's happening here...
> > > > 
> > > > 	-Andre
> > > > 
> > > > (*) To be honest, I use this patch so I had to modify only one file:
> > > > 
> > > > --- sys/kern/subr_autoconf.c.ORI	2011-02-05 13:14:11.000000000 +0100
> > > > +++ sys/kern/subr_autoconf.c	2011-04-15 14:34:31.000000000 +0200
> > > > @@ -108,7 +108,7 @@
> > > >  	warned = 0;
> > > >  	while (!TAILQ_EMPTY(&intr_config_hook_list)) {
> > > >  		if (msleep(&intr_config_hook_list, &intr_config_hook_lock,
> > > > -		    0, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > > +		    PRI_MIN_KERN + 32, "conifhk", WARNING_INTERVAL_SECS * hz) ==
> > > >  		    EWOULDBLOCK) {
> > > >  			mtx_unlock(&intr_config_hook_lock);
> > > >  			warned++;
> > > 
> > > Do you get any warnings about CAM timeouts, etc. when these probe?  A verbose 
> > > dmesg might be nice to look at if possible.
> > 
> > OK, I have set up a machine for testing. In my other mail
> > I was wrong saying that the pass devices appear when using
> > the problematic kernel...
> > 
> > Here are the dmesgs:
> > 
> > - dmesg_bad is the original kernel as of Friday
> > - dmesg_ok is the patched kernel (see above) as of Friday
> > - dmesg.diff is the diff between both
> > 
> > If you want me to try something just tell me...
> 
> Hmmm, what if you make SCSI_DELAY larger?  Also, can you let it fail the

I tried this already. Normally, I use 500 (which works on all machines)
and retried with 5000. No change...

> mount and drop into ddb and then get 'ps' output?

I will try it tomorrow.

> 
> I think the CAM boot probe is broken a bit.  xpt_rescan_done() always calls
> xpt_release_boot(), but we don't hold the boot for each bus added while
> buses_config_done is 0, so it seems CAM only waits for at least one bus to
> rescan before it lets the boot continue?  This seems wrong (i.e. one would
> think it would let all the busses added before this point scan before
> continuing).

Hmm, I got only one SCSI bus in that machine. Of my 17 machines, 15 are
SCSI-based. Only 4 had this problem. One of them has gut two busses,
the others only one.

> 
> However, in your dmesg, it starts to print out an announcement for a pass
> device before it starts mounting root, so it seems that xpt is finishing too
> early somehow.

Yes, I saw this as well. And in the "good" dmesg there are these
two lines which look a bit screwn up:

GEOM: nda0 at ahc0 bus 0 target 0 lun 0
...
ew disk da1

Don't know if this indicates some problem...

As I said, when I first run into this problem last week, I _THINK_ I saw
the pass devices appear at least on one broken box. But I won't swear
on this. Another thing I remember is that there was at least one problematic
box which booted successfully on the second try.

Thanks,

	-Andre

> 
> -- 
> John Baldwin

-- 
Linux: Sozialismus, der nicht funktioniert