Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 29 Jun 1999 17:00:50 -0600 (MDT)
From:      "Kenneth D. Merry" <ken@plutotech.com>
To:        jgreco@ns.sol.net (Joe Greco)
Cc:        scsi@FreeBSD.ORG
Subject:   Re: FreeBSD panics with Mylex DAC960SX
Message-ID:  <199906292300.RAA29666@panzer.kdm.org>
In-Reply-To: <199906291850.NAA83997@aurora.sol.net> from Joe Greco at "Jun 29, 1999 01:50:13 pm"

next in thread | previous in thread | raw e-mail | index | archive | help
Joe Greco wrote...
> Hello,
> 
> First, cool stuff in 3.X!  Hats off to you guys.
> 
> I have one minor issue that I am hoping is a simple fix.
> 
> I'm using Mylex DAC960SX SCSI-to-SCSI RAID controllers on an ASUS P2B-DS
> motherboard, off of the onboard SCSI controller.  This is a neat gadget
> that makes a bunch of drives look like a single SCSI target.
> 
> Now...  here's the problem.  The unit takes a while to start up (~60s)
> from power on, and until it reports "STARTUP COMPLETE", FreeBSD blows
> chunks when trying to access it.
> 
> In particular, when the Mylex freaks out and thinks half its disks are
> dead (duh forgot to power them on), the startup sequence never completes,
> and FreeBSD will sit there doing boot-panic-boot-panic-etc.  This is not
> very gracious, and is a bit irritating since the serial console I need to
> talk to the Mylex is on the box...
> 
> So, my _real_ issue is the following panic:

[ ... ]

> da1 at ahc0 bus 0 target 1 lun 0
> da1: <MYLEX DAC960SX138928B5 4332> Fixed Direct Access SCSI-2 device 
> da1: 40.0MB/s transfers (20.0MHz, offset 16, 16bit), Tagged Queueing Enabled
> da1: A
> de0: autosense failed: cable problem?
> swapon: adding /dev/da0s1b as swap device
> Automatic reboot in progress...
> /dev/rda0s1a: FILESYSTEM CLEAN^M; SKIPPING CHECK
> S
> ^M/dev/rda0s1a: 
> clean, 138968 frFee (296 frags, 1a7334 blocks, 0.2t% fragmentation)a
> l trap 18: integer divide fault while in kernel mode
> mp_lock = 01000002; cpuid = 1; lapic.id = 00000000
> instruction pointer     = 0x8:0xf014a681
> stack pointer           = 0x10:0xfa66b9d8
> frame pointer           = 0x10:0xfa66ba00
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 18 (fsck)
> interrupt mask          =  <- SMP: XXX
> trap number             = 18
> panic: integer divide fault
> mp_lock = 01000002; cpuid = 1; lapic.id = 00000000
> boot() called on cpu#1
> 
> syncing disks... done
> (da1:ahc0:0:1:0): SYNCHRONIZE CACHE. CDB: 35 0 0 0 0 0 0 0 0 0 
> (da1:ahc0:0:1:0): NOT READY
> Automatic reboot in 15 seconds - press a key on the console to abort
> Rebooting...
> cpu_reset called on cpu#1
> cpu_reset: Stopping other CPUs
> cpu_reset: Restarting BSP
> cpu_reset_proxy: Grabbed mp lock for BSP
> cpu_reset_proxy: Stopped CPU 1
> 
> I apologize for not reproducing this on a 3.2R box but I assure you that
> it also panics in fsck on 3.2R in what appears to be an identical manner.
> The panic does seem to be caused by fsck - I can enter single user mode
> just fine.
> 
> My guess is that the integer divide fault results from the device reporting
> a size of zero (strictly a guess though!).  Normally, size is reported as
> 
> da1: <MYLEX DAC960SX138928B5 4332> Fixed Direct Access SCSI-2 device 
> da1: 40.0MB/s transfers (20.0MHz, offset 16, 16bit), Tagged Queueing Enabled
> da1: 138928MB (284524544 512 byte sectors: 255H 63S/T 17710C)
> 
> but during all of these crash-boots, the third line is
> 
> da1: <MYLEX DAC960SX138928B5 4332> Fixed Direct Access SCSI-2 device 
> da1: 40.0MB/s transfers (20.0MHz, offset 16, 16bit), Tagged Queueing Enabled
> da1: A

That should probably read "Attempt to query device size failed ...."

You may be losing characters over the serial console or something.

> If I can provide further information to assist in tracking down this bug,
> please let me know.

My first guess is that it's happening during the open() routine, for some
reason.  That's why fsck seems to cause the problem.

You're probably right about the device returning a size of zero.  It isn't
immediately clear to me why the open routine would cause a panic, *unless*
the Mylex unit returns good status for the read capacity command, but
returns a capacity of 0.

It would be helpful to get a stack trace from the machine, if you can.
Enabling DDB at least will give us a DDB stack trace.

> Also, I was wondering more generally about what the proper way to deal with
> a device such as this is.  Assuming FreeBSD didn't actually crash when
> trying to access the device, it is still possible to attempt booting when
> the DAC controller is not ready, which will result - presumably - in fsck
> exiting and complaining about that filesystem.  What is the "correct" way
> to wait for something like this to become ready?  Is there a "correct" way,
> even?

Well, it really depends on how the device behaves.  Here's what happens
after the initial probe phase:

- the da driver sends a read capacity to the disk, with a retry count of 4
  and a timeout of 5 seconds.

	1.  The read capacity succeeds, and the probe continues normally.
	2.  The read capacity fails, and one of a few things happen:

		1.  If the error has an associated error recovery action,
		    we may send a start unit to the disk, or one TUR every
		    half second for a minute.  Then we retry the original
		    command.
		2.  If the error has no associated error recovery action,
		    we just retry it until the retry count is exhausted.

My guess is that the error returned by the Mylex unit may not be an
error with an associated recovery action.  So we just retry it four times
and then report the "Attempt to query device size failed ..." where ... is
the error.

Unfortunately, you're not getting the error printout, probably because of
serial console weirdness.  Could you try booting with -v?  That will cause
the full sense information for the error to get printed out, and maybe
we'll have a better chance of figuring out what the error is.

Also, once you boot up in single user mode, you might try the following
camcontrol command:

camcontrol cmd -n da -u 1 -v -c "25 0 0 0 0 0 0 0 0 0" -i 8 "i4 i4"

That will issue a read capacity command to da1, and print out the total
number of blocks in the disk and the block size.  The -v will tell
camcontrol to print out sense information.

Ken
-- 
Kenneth Merry
ken@plutotech.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199906292300.RAA29666>