From owner-freebsd-stable Wed Dec 1 10: 8: 2 1999 Delivered-To: freebsd-stable@freebsd.org Received: from panzer.kdm.org (panzer.kdm.org [216.160.178.169]) by hub.freebsd.org (Postfix) with ESMTP id 4A0571506F for ; Wed, 1 Dec 1999 10:07:55 -0800 (PST) (envelope-from ken@panzer.kdm.org) Received: (from ken@localhost) by panzer.kdm.org (8.9.3/8.9.1) id LAA43219; Wed, 1 Dec 1999 11:06:35 -0700 (MST) (envelope-from ken) Message-Id: <199912011806.LAA43219@panzer.kdm.org> Subject: Re: vinum experiences. In-Reply-To: <14405.8810.777783.992833@trooper.velocet.net> from David Gilbert at "Dec 1, 1999 08:28:10 am" To: dgilbert@velocet.ca (David Gilbert) Date: Wed, 1 Dec 1999 11:06:35 -0700 (MST) Cc: stable@FreeBSD.ORG From: "Kenneth D. Merry" X-Mailer: ELM [version 2.4ME+ PL54 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG [ If you want to comment on SCSI issues, I would suggest mailing the -scsi list, since you'll get a wider audience of people who know about SCSI. ] David Gilbert wrote... > While I'm still chasing the memory corruption bug in vinum, I have a > couple of observations. > > 1. Removing a device (at least, with the ahc controller) locks the bus > even though I have a RAID hot-swap ready chassy (that properly > isolates the bus between commands). In my test, I had a completely > quiet SCSI bus when I removed one of the drives. I then wrote to the > RAID array. I got: > > Nov 30 18:31:51 raid1 /kernel: (da8:ahc1:0:11:0): Invalidating pack > Nov 30 18:31:51 raid1 /kernel: raid.p0.s6: fatal read I/O error > Nov 30 18:31:51 raid1 /kernel: vinum: raid.p0.s6 is crashed by force > Nov 30 18:31:52 raid1 /kernel: vinum: raid.p0 is degraded > Nov 30 18:31:52 raid1 /kernel: d7: fatal drive I/O error > Nov 30 18:31:52 raid1 /kernel: vinum: drive d7 is down > Nov 30 18:31:52 raid1 /kernel: raid.p0.s6: fatal write I/O error > Nov 30 18:31:52 raid1 /kernel: vinum: raid.p0.s6 is stale by force > Nov 30 18:31:52 raid1 /kernel: d7: fatal drive I/O error > Nov 30 18:31:52 raid1 /kernel: biodone: buffer already done That looks like it may be a vinum issue. You shouldn't be getting buffers done twice, as that error message indicates. Have you talked to Greg at all about this? If you're chasing down bugs in Vinum, it would make sense to contact the author and work with him to either find the problem, or trace it to some other part of the system. > Nov 30 18:31:52 raid1 /kernel: (da8:ahc1:0:11:0): Synchronize cache failed, status == 0x4a, scsi status == 0x0 > Nov 30 18:33:16 raid1 /kernel: (da8:ahc1:0:11:0): lost device > Nov 30 18:33:16 raid1 /kernel: (da8:ahc1:0:11:0): removing device entry > > ... I got more than one of the Synchronize cache failed. the "lost > device" was when I "camcontrol rescan 1" ... I did do a "camcontrol > reset 1", but it didn't affect things. All of that is normal. The synchronize cache failed since there was no device there to talk to. You probably got more than one of those because it was retried. > The net result is that SCSI bus 1 was wedged after this. I would > conjecture that removing a device (and running with this device > removed is precisely what the chassy was designed to do) should not > wedge things. How do you know the bus was wedged? Could you issue SCSI commands with camcontrol? e.g.: camcontrol tur da10 -v Will issue a test unit ready to da10. If it responds, the bus isn't wedged. > In fact, since the camcontrol rescan 1 was successful, I suggest that > it was cam, not the ahc driver that was somehow wedged. I don't think it's clear at all what wedged. The fact that you were able to rescan the bus indicates that the CAM side of things is probably working properly. One of the things that a rescan does is send a SCSI inquiry command to every possible target ID on the bus. You can't do that if the bus is wedged. Ken -- Kenneth Merry ken@kdm.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message