From owner-freebsd-scsi Fri Jul 2 10:32:31 1999 Delivered-To: freebsd-scsi@freebsd.org Received: from panzer.kdm.org (panzer.kdm.org [216.160.178.169]) by hub.freebsd.org (Postfix) with ESMTP id 4DA6D1562C for ; Fri, 2 Jul 1999 10:32:24 -0700 (PDT) (envelope-from ken@panzer.kdm.org) Received: (from ken@localhost) by panzer.kdm.org (8.9.3/8.9.1) id LAA51596; Fri, 2 Jul 1999 11:31:00 -0600 (MDT) (envelope-from ken) Message-Id: <199907021731.LAA51596@panzer.kdm.org> Subject: Re: FreeBSD panics with Mylex DAC960SX In-Reply-To: <199907021646.LAA77311@aurora.sol.net> from Joe Greco at "Jul 2, 1999 11:46:41 am" To: jgreco@ns.sol.net (Joe Greco) Date: Fri, 2 Jul 1999 11:31:00 -0600 (MDT) Cc: scsi@freebsd.org From: "Kenneth D. Merry" X-Mailer: ELM [version 2.4ME+ PL54 (25)] MIME-Version: 1.0 Content-Type: multipart/mixed; boundary=ELM930936660-51339-0_ Content-Transfer-Encoding: 7bit Sender: owner-freebsd-scsi@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org --ELM930936660-51339-0_ Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Joe Greco wrote... > > Thanks for all the work on this! I talked to Justin for a minute, and I > > think we've figured out what the problem is. > > > > It's a little more complicated than this, but the simple explanation is > > that we aren't doing the right thing when a command comes back with just a > > sense key and no ASC or ASCQ. It's hard to believe we haven't run into > > this before, but I think that's the problem. > > > > Try applying the attached patch to scsi_all.c. It isn't the final patch > > for this problem, the solution is probably a little more complicated than > > this. But hopefully this will let us know whether the problem is what we > > think it is. > [ ... ] > (da0:ahc0:0:0:0): printed announcement > o da0s1a > (da0:ahc0:0:0:0): read capacity returned 0 > (da1:ahc0:0:1:0): READ CAPACITY. CDB: 25 0 0 0 0 0 0 0 0 0 > (da1:ahc0:0:1:0): NOT READY > (da1:ahc0:0:1:0): fatal error, failed to attach to device > (da1:ahc0:0:1:0): lost device > (da1:ahc0:0:1:0): about to print announcement > (da1:ahc0:0:1:0): printed announcement > (da1:ahc0:0:1:0): removing device entry > (da0:ahc0:0:0:0): READ CAPACITY. CDB: 25 0 0 0 0 0 0 0 0 0 > (da0:ahc0:0:0:0): NOT READY > (da0:ahc0:0:0:0): address = 8496883, length = 512 > Enter full pathname of shell or RETURN for /bin/sh: > erase ^H, kill ^U, intr ^C > /sbin/camcontrol cmd -n da -u 1 -v -c 25 0 0 0 0 0 0 0 0 0 -i 8 i4 i4 > camcontrol: cam_lookup_pass: CAMGETPASSTHRU ioctl failed > cam_lookup_pass: No such file or directory > cam_lo(okup_pdass: either the apass driver isn'0t in your kernel: > cam_lookup_pasas: or da1 doesn'ht exist > end of ccamcontrol [ ... ] > I take it the "fatal error, failed to attach to device" is what you were > trying for? Yes and no. I forgot that you were running 3.1-era code. I made some changes before 3.2 went out (I can't remember when, but you could probably just look at the cvs logs for scsi_da.c and see) to make the da and cd drivers to make them attach in almost every case. (the exception being when the drive returns a "logical unit not supported" error) So, what I would expect to happen with 3.2 or 4.0 code would be that the da driver would attach, but you wouldn't be able to open it to fsck the drive until the drive is ready. > > You should be able to boot okay with this patch, although you probably > > won't be able to fsck or mount the Mylex array until it's ready to run. > > I would expect that. If you have an elegant (or correct) solution to deal > with this - for me, at least, preferably from userland - I'm all ears. > > What I'm thinking is just sitting there querying the thing every few > seconds until it reports a size, then have CAM re-query the device. Does > that seem reasonable? > > I'm actually not interested in a solution more complex than that because > in the event the RAID fails, it does the same sort of thing, and I'd like > to be able to get into the machine from remote and talk to the Mylex. Well, I would suggest either upgrading the machine to a newer -stable code base or applying the attached patch to scsi_da.c. That will make the da driver attach even when your Mylex unit isn't ready. Then, you might be able to do something like this in /etc/rc: camcontrol tur -n da -u 1 >/dev/null 2>&1 while [ "$?" != 0 ] do sleep 1 camcontrol tur -n da -u 1 >/dev/null 2>&1 done ...do fsck, etc... You might also want to put a maximum count in there or something so that if the Mylex doesn't become ready after a given period of time, you stop waiting for it. What the above script fragment is taking advantage of is that 'camcontrol tur' will only exit with a 0 status if the device is ready. We could also probably do something within the CAM error recovery code along the same lines, but I think it would probably end up being a kludge, and possibly not correct, at least within the current error recovery framework. If the above camcontrol trick works for you, that's probably the best solution, since it gives you an easy way to tweak or disable the test if you want. Anyway, lemme know whether it works for you, or if you've got more questions or whatever.. Ken -- Kenneth Merry ken@plutotech.com --ELM930936660-51339-0_ Content-Type: text/plain; charset=US-ASCII Content-Disposition: attachment; filename=scsi_da.c.1.22 Content-Description: scsi_da.c.1.22 Content-Transfer-Encoding: 7bit Index: scsi_da.c =================================================================== RCS file: /usr/local/cvs/src/sys/cam/scsi/scsi_da.c,v retrieving revision 1.21 retrieving revision 1.22 diff -c -r1.21 -r1.22 *** scsi_da.c 1999/03/05 23:20:20 1.21 --- scsi_da.c 1999/05/06 20:16:04 1.22 *************** *** 25,31 **** * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ! * $Id: scsi_da.c,v 1.21 1999/03/05 23:20:20 gibbs Exp $ */ #include "opt_hw_wdog.h" --- 25,31 ---- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ! * $Id: scsi_da.c,v 1.22 1999/05/06 20:16:04 ken Exp $ */ #include "opt_hw_wdog.h" *************** *** 1381,1393 **** &asc, &ascq); } /* ! * With removable media devices, we expect ! * 0x3a (Medium not present) errors, since not ! * everyone leaves a disk in the drive. If ! * the error is anything else, though, we ! * shouldn't attach. */ ! if ((have_sense) && (asc == 0x3a) && (error_code == SSD_CURRENT_ERROR)) snprintf(announce_buf, sizeof(announce_buf), --- 1381,1392 ---- &asc, &ascq); } /* ! * Attach to anything that claims to be a ! * direct access or optical disk device, ! * as long as it doesn't return a "Logical ! * unit not supported" (0x25) error. */ ! if ((have_sense) && (asc != 0x25) && (error_code == SSD_CURRENT_ERROR)) snprintf(announce_buf, sizeof(announce_buf), --ELM930936660-51339-0_-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message