From owner-freebsd-scsi@FreeBSD.ORG Mon Jan 8 11:08:51 2007 Return-Path: X-Original-To: freebsd-scsi@FreeBSD.org Delivered-To: freebsd-scsi@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 625C316A505 for ; Mon, 8 Jan 2007 11:08:51 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [69.147.83.40]) by mx1.freebsd.org (Postfix) with ESMTP id 4F6D313C45A for ; Mon, 8 Jan 2007 11:08:51 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (linimon@localhost [127.0.0.1]) by freefall.freebsd.org (8.13.4/8.13.4) with ESMTP id l08B8ptY016610 for ; Mon, 8 Jan 2007 11:08:51 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from linimon@localhost) by freefall.freebsd.org (8.13.4/8.13.4/Submit) id l08B8nrw016606 for freebsd-scsi@FreeBSD.org; Mon, 8 Jan 2007 11:08:49 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 8 Jan 2007 11:08:49 GMT Message-Id: <200701081108.l08B8nrw016606@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: linimon set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-scsi@FreeBSD.org Cc: Subject: Current problem reports assigned to you X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 08 Jan 2007 11:08:51 -0000 Current FreeBSD problem reports Critical problems Serious problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/27059 scsi [sym] SCSI subsystem hangs under heavy load on (Server o kern/39388 scsi ncr/sym drivers fail with 53c810 and more than 256MB m o kern/40895 scsi wierd kernel / device driver bug o kern/52638 scsi [panic] SCSI U320 on SMP server won't run faster than s kern/57398 scsi [mly] Current fails to install on mly(4) based RAID di o kern/60598 scsi wire down of scsi devices conflicts with config o kern/60641 scsi [sym] Sporadic SCSI bus resets with 53C810 under load s kern/61165 scsi [panic] kernel page fault after calling cam_send_ccb o kern/74627 scsi [ahc] [hang] Adaptec 2940U2W Can't boot 5.3 o kern/81887 scsi [aac] Adaptec SCSI 2130S aac0: GetDeviceProbeInfo comm o kern/90282 scsi [sym] SCSI bus resets cause loss of ch device o kern/92798 scsi [ahc] SCSI problem with timeouts o kern/93128 scsi [sym] FreeBSD 6.1 BETA 1 has problems with Symbios/LSI o kern/94838 scsi Kernel panic while mounting SD card with lock switch o o kern/99954 scsi [ahc] reading from DVD failes on 6.x (regression) 15 problems total. Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o kern/23314 scsi aic driver fails to detect Adaptec 1520B unless PnP is o kern/35234 scsi World access to /dev/pass? (for scanner) requires acce o kern/38828 scsi [feature request] DPT PM2012B/90 doesn't work o kern/44587 scsi dev/dpt/dpt.h is missing defines required for DPT_HAND o kern/76178 scsi [ahd] Problem with ahd and large SCSI Raid system o kern/96133 scsi [scsi] [patch] add scsi quirk for joyfly 128mb flash u o kern/103702 scsi [cam] [patch] ChipsBnk: Unsupported USB memory stick 7 problems total. From owner-freebsd-scsi@FreeBSD.ORG Tue Jan 9 07:32:04 2007 Return-Path: X-Original-To: freebsd-scsi@FreeBSD.org Delivered-To: freebsd-scsi@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AEBA616A407; Tue, 9 Jan 2007 07:32:04 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from cs1.cs.huji.ac.il (cs1.cs.huji.ac.il [132.65.16.10]) by mx1.freebsd.org (Postfix) with ESMTP id 6A4C013C459; Tue, 9 Jan 2007 07:32:04 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from pampa.cs.huji.ac.il ([132.65.80.32]) by cs1.cs.huji.ac.il with esmtp id 1H4B4I-0001eX-UC; Tue, 09 Jan 2007 09:06:46 +0200 X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.2 To: freebsd-scsi@FreeBSD.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 09 Jan 2007 09:06:46 +0200 From: Danny Braniss Message-ID: Cc: freebsd-hackers@freebsd.org Subject: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jan 2007 07:32:04 -0000 Hi, While I think I have almost solved the problem of network disconnects, It downed on me a major problem: When a 'local' disk crashes, the kernel will probably hang/panic/crash. if i don't try to recover, then there is no change in the above scenario. if i try to recover, then the client does not know that it should umount/fsck/mount. While all this seems familiar, removing a floppy/disk-on-key while it's mounted, we could always say "you shouldn't have done that!", with a network connection, it can happen very often - rebooting the target, a network hickup, etc. So, any ideas? danny From owner-freebsd-scsi@FreeBSD.ORG Tue Jan 9 14:53:25 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 8683A16A412; Tue, 9 Jan 2007 14:53:25 +0000 (UTC) (envelope-from lists@jnielsen.net) Received: from ns1.jnielsen.net (ns1.jnielsen.net [69.55.238.237]) by mx1.freebsd.org (Postfix) with ESMTP id 4D25013C44C; Tue, 9 Jan 2007 14:53:25 +0000 (UTC) (envelope-from lists@jnielsen.net) Received: from localhost (jn@ns1 [69.55.238.237]) (authenticated bits=0) by ns1.jnielsen.net (8.12.9p2/8.12.9) with ESMTP id l09EY44o042517; Tue, 9 Jan 2007 06:34:05 -0800 (PST) (envelope-from lists@jnielsen.net) From: John Nielsen To: freebsd-hackers@freebsd.org Date: Tue, 9 Jan 2007 09:31:28 -0500 User-Agent: KMail/1.9.5 References: In-Reply-To: X-Face: #X5#Y*q>F:]zT!DegL3z5Xo'^MN[$8k\[4^3rN~wm=s=Uw(sW}R?3b^*f1Wu*.<=?utf-8?q?of=5F4NrS=0A=09P*M/9CpxDo!D6?=)IY1w<9B1jB; tBQf[RU-R<,I)e"$q7N7 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200701090931.28786.lists@jnielsen.net> X-Virus-Scanned: ClamAV version 0.88.4, clamav-milter version 0.88.4 on ns1.jnielsen.net X-Virus-Status: Clean Cc: freebsd-scsi@freebsd.org Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jan 2007 14:53:25 -0000 On Tuesday 09 January 2007 02:06, Danny Braniss wrote: > Hi, > While I think I have almost solved the problem of network disconnects, > It downed on me a major problem: > When a 'local' disk crashes, the kernel will probably hang/panic/crash. > if i don't try to recover, then there is no change in the above scenario. > if i try to recover, then the client does not know that it should > umount/fsck/mount. > While all this seems familiar, removing a floppy/disk-on-key while it's > mounted, we could always say "you shouldn't have done that!", with > a network connection, it can happen very often - rebooting the target, a > network hickup, etc. > > So, any ideas? I think that an iSCSI network disconnect (if handled properly) is more like a bad/flakey set of sectors and/or extremely high latency than a total disk crash. The initiator should stall as long as it can while trying to reconnect the session, and then send "hardware" timeout errors up the stack. The the rest of the OS should handle those the same as it would any other timeout errors--retry a certain number of times and then fail. I don't know how graceful the failure case is (perhaps not very), but it's an honest approximation. The above approach is IMO more than adequate for network interruptions lasting a few seconds (or a bit more). I'm not sure there's anything you can realistically do more than that. Administrators who intentionally reboot a nonredundant iSCSI target while it has active sessions are asking for trouble, and if the reboot is accidental they should do one or more of a) know to run fsck manually, b) get a better UPS, c) get a more stable/redundant iSCSI target device. Disclaimer: I know next to nothing about kernel programming, device driver development, or scsi in general. I've just been playing with and thinking about iSCSI on FreeBSD a fair amount lately. Thanks for your continued work on this. JN From owner-freebsd-scsi@FreeBSD.ORG Tue Jan 9 16:38:52 2007 Return-Path: X-Original-To: freebsd-scsi@FreeBSD.ORG Delivered-To: freebsd-scsi@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 63C6916A40F for ; Tue, 9 Jan 2007 16:38:52 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (lurza.secnetix.de [83.120.8.8]) by mx1.freebsd.org (Postfix) with ESMTP id CBF3E13C448 for ; Tue, 9 Jan 2007 16:38:51 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (uvqlwx@localhost [127.0.0.1]) by lurza.secnetix.de (8.13.4/8.13.4) with ESMTP id l09GGUZT020582; Tue, 9 Jan 2007 17:16:35 +0100 (CET) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.13.4/8.13.1/Submit) id l09GGTJu020581; Tue, 9 Jan 2007 17:16:29 +0100 (CET) (envelope-from olli) Date: Tue, 9 Jan 2007 17:16:29 +0100 (CET) Message-Id: <200701091616.l09GGTJu020581@lurza.secnetix.de> From: Oliver Fromme To: freebsd-hackers@FreeBSD.ORG, freebsd-scsi@FreeBSD.ORG, danny@cs.huji.ac.il In-Reply-To: X-Newsgroups: list.freebsd-hackers User-Agent: tin/1.8.2-20060425 ("Shillay") (UNIX) (FreeBSD/4.11-STABLE (i386)) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.1.2 (lurza.secnetix.de [127.0.0.1]); Tue, 09 Jan 2007 17:16:35 +0100 (CET) Cc: Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: freebsd-hackers@FreeBSD.ORG, freebsd-scsi@FreeBSD.ORG, danny@cs.huji.ac.il List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jan 2007 16:38:52 -0000 Danny Braniss wrote: > While I think I have almost solved the problem of network disconnects, > It downed on me a major problem: > When a 'local' disk crashes, the kernel will probably hang/panic/crash. > if i don't try to recover, then there is no change in the above scenario. > if i try to recover, then the client does not know that it should > umount/fsck/mount. > While all this seems familiar, removing a floppy/disk-on-key while it's > mounted, we could always say "you shouldn't have done that!", with > a network connection, it can happen very often - rebooting the target, a > network hickup, etc. The IEEE1394 code (firewire) contains a hack so you can remove a _mounted_ drive (yes, pull the plug!) and later reconnect it and continue to use the filesystem. I think processes that try to access the file system during the drive being unavailable are blocked ("D" state a.k.a. "diskwait"). The purpose of that feature is that you can change the topology (e.g. remove a device that's not at the end of the bus) without having to unmount all other devices. Well, it's just a hack, and I don't know if something similar is applicable to the iSCSI situation. But I thought it wouldn't hurt to mention it anyhow. Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "If you think C++ is not overly complicated, just what is a protected abstract virtual base pure virtual private destructor, and when was the last time you needed one?" -- Tom Cargil, C++ Journal From owner-freebsd-scsi@FreeBSD.ORG Tue Jan 9 17:05:19 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2369516A494; Tue, 9 Jan 2007 17:05:19 +0000 (UTC) (envelope-from lists@jnielsen.net) Received: from ns1.jnielsen.net (ns1.jnielsen.net [69.55.238.237]) by mx1.freebsd.org (Postfix) with ESMTP id 04A7B13C4A6; Tue, 9 Jan 2007 17:05:18 +0000 (UTC) (envelope-from lists@jnielsen.net) Received: from localhost (jn@ns1 [69.55.238.237]) (authenticated bits=0) by ns1.jnielsen.net (8.12.9p2/8.12.9) with ESMTP id l09H574o019218; Tue, 9 Jan 2007 09:05:07 -0800 (PST) (envelope-from lists@jnielsen.net) From: John Nielsen To: freebsd-hackers@freebsd.org Date: Tue, 9 Jan 2007 12:02:31 -0500 User-Agent: KMail/1.9.5 References: In-Reply-To: X-Face: #X5#Y*q>F:]zT!DegL3z5Xo'^MN[$8k\[4^3rN~wm=s=Uw(sW}R?3b^*f1Wu*.<=?utf-8?q?of=5F4NrS=0A=09P*M/9CpxDo!D6?=)IY1w<9B1jB; tBQf[RU-R<,I)e"$q7N7 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200701091202.32226.lists@jnielsen.net> X-Virus-Scanned: ClamAV version 0.88.4, clamav-milter version 0.88.4 on ns1.jnielsen.net X-Virus-Status: Clean Cc: freebsd-scsi@freebsd.org, Dan Nelson Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jan 2007 17:05:19 -0000 Forwarding a relevant comment from a parallel discussion on -questions. ---------- Forwarded Message ---------- Subject: Re: iSCSI Date: Tuesday 09 January 2007 11:35 From: Dan Nelson To: DAve Cc: Free BSD Questions list In the last episode (Jan 09), DAve said: > The developers response, for those who are interested. > > hi Dave, > the initiator for iSCSI will hit stable/current real soon now. > that was the good news, now for the down side: > what was missing all along was recovery from network disconnects, so > while I think I have it almost worked out, I've come across a major > flow in the iscsi design: > when the targets crashes, and comes back, there is no way > to tell the client to run an fsck. This is not a problem if the > client is mounting the iscsi partition read only. > > danny Why should the client need to do an fsck? From its point of view it should just look like the target had the iSCSI equivalent of a bus reset. It should resend any queued requests and continue. On Tuesday 09 January 2007 02:06, Danny Braniss wrote: > Hi, > While I think I have almost solved the problem of network disconnects, > It downed on me a major problem: > When a 'local' disk crashes, the kernel will probably hang/panic/crash. > if i don't try to recover, then there is no change in the above scenario. > if i try to recover, then the client does not know that it should > umount/fsck/mount. > While all this seems familiar, removing a floppy/disk-on-key while it's > mounted, we could always say "you shouldn't have done that!", with > a network connection, it can happen very often - rebooting the target, a > network hickup, etc. From owner-freebsd-scsi@FreeBSD.ORG Tue Jan 9 21:08:26 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C6D3E16A407 for ; Tue, 9 Jan 2007 21:08:26 +0000 (UTC) (envelope-from freebsd-scsi@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id 75ACB13C46B for ; Tue, 9 Jan 2007 21:08:26 +0000 (UTC) (envelope-from freebsd-scsi@m.gmane.org) Received: from root by ciao.gmane.org with local (Exim 4.43) id 1H4NIK-000139-Kh for freebsd-scsi@freebsd.org; Tue, 09 Jan 2007 21:10:04 +0100 Received: from 89-172-49-221.adsl.net.t-com.hr ([89.172.49.221]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 09 Jan 2007 21:10:04 +0100 Received: from ivoras by 89-172-49-221.adsl.net.t-com.hr with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Tue, 09 Jan 2007 21:10:04 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-scsi@freebsd.org From: Ivan Voras Date: Tue, 09 Jan 2007 21:04:28 +0100 Lines: 28 Message-ID: References: <200701090931.28786.lists@jnielsen.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enigB4F4BA0A9F0163D722FD25B5" X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: 89-172-49-221.adsl.net.t-com.hr User-Agent: Thunderbird 1.5.0.9 (Windows/20061207) In-Reply-To: <200701090931.28786.lists@jnielsen.net> X-Enigmail-Version: 0.94.1.2 Sender: news Cc: freebsd-hackers@freebsd.org Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Jan 2007 21:08:26 -0000 This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigB4F4BA0A9F0163D722FD25B5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable John Nielsen wrote: > I don't know how=20 > graceful the failure case is (perhaps not very)... Not at all - removing a mounted USB device panics the kernel. --------------enigB4F4BA0A9F0163D722FD25B5 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.4 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFo/VSldnAQVacBcgRAmDJAJ994m1Rk2FiPv/HC3jrJlgd8IkyfACfTqQV Qao+ofnehodBCORsIFDE5qM= =SSFc -----END PGP SIGNATURE----- --------------enigB4F4BA0A9F0163D722FD25B5-- From owner-freebsd-scsi@FreeBSD.ORG Wed Jan 10 13:42:16 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 5B48316A403 for ; Wed, 10 Jan 2007 13:42:16 +0000 (UTC) (envelope-from cstdenis@ctgameinfo.com) Received: from luna.ctgameinfo.com (luna.ctgameinfo.com [65.110.52.10]) by mx1.freebsd.org (Postfix) with ESMTP id 2FF6F13C457 for ; Wed, 10 Jan 2007 13:42:16 +0000 (UTC) (envelope-from cstdenis@ctgameinfo.com) Received: from [192.168.1.100] (S01060016b606ed02.vc.shawcable.net [24.87.22.207]) (AUTH: LOGIN chris@ctgameinfo.com) by luna.ctgameinfo.com with esmtp; Wed, 10 Jan 2007 05:02:38 -0800 id 00078C19.45A4E3EF.00015860 Message-ID: <45A4E3AD.1040600@ctgameinfo.com> Date: Wed, 10 Jan 2007 05:01:33 -0800 From: Cstdenis User-Agent: Thunderbird 1.5.0.9 (Windows/20061207) MIME-Version: 1.0 To: freebsd-scsi@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Bug in aac? X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Jan 2007 13:42:16 -0000 I am running 6.1-p11 with a Adaptec SAS RAID 4800SAS running a mirror of 2 15k rpm SCSI drives. Under heavy IO load (Its a database server) I get the following accompanied by serious system lag: Jan 9 20:52:42 ayu kernel: aac0: COMMAND 0xc8f56f80 TIMEOUT AFTER 34 SECONDS Jan 9 20:52:42 ayu kernel: aac0: COMMAND 0xc8f54f00 TIMEOUT AFTER 34 SECONDS Jan 9 20:52:42 ayu kernel: aac0: COMMAND 0xc8f56b40 TIMEOUT AFTER 34 SECONDS Jan 9 20:52:42 ayu kernel: aac0: COMMAND 0xc Jan 9 20:52:43 ayu kernel: 8f56740 TIMEOUT AFTER 34 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f57640 TIMEOUT AFTER 34 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f58440 TIMEOUT AFTER 34 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f57bc0 TIMEOUT AFTER 34 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f59b40 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f57c80 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f59600 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f5a0c0 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f58c00 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f55f40 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f580c0 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f53280 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f58bc0 TIMEOUT AFTER 35 SECONDS Jan 9 20:52:43 ayu kernel: aac0: COMMAND 0xc8f59940 TIMEOUT AFTER 35 SECONDS (hex after command and number of seconds varies) Excerpts from dmesg ------------------- FreeBSD 6.1-RELEASE-p11 #0: Wed Jan 3 19:06:12 CST 2007 root@ayu.ctgameinfo.com:/usr/obj/usr/src/sys/AYU Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Xeon(R) CPU 3060 @ 2.40GHz (2394.01-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x6f6 Stepping = 6 Features=0xbfebfbff Features2=0xe3bd,CX16,,> AMD Features=0x20100000 AMD Features2=0x1 Cores per package: 2 real memory = 3622699008 (3454 MB) avail memory = 3545722880 (3381 MB) aac0: mem 0xd8400000-0xd85fffff,0xd8200000-0xd83fffff,0xe0000000-0xe7ffffff irq 26 at device 14.0 on pci11 aac0: New comm. interface enabled aac0: Adaptec Raid Controller 2.0.0-1 aacd0: on aac0 aacd0: 69988MB (143335424 sectors) The problem happens a few times a day each time lasting only a matter of minutes. I searched the mailing lists for other having this problem, but all I found were older ones from early 5.x that are supposed to be fixed now. From owner-freebsd-scsi@FreeBSD.ORG Thu Jan 11 16:27:46 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A623316A403 for ; Thu, 11 Jan 2007 16:27:46 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 6B81613C441 for ; Thu, 11 Jan 2007 16:27:46 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id l0BGRblN021358; Thu, 11 Jan 2007 09:27:43 -0700 (MST) (envelope-from scottl@samsco.org) Message-ID: <45A66576.9010106@samsco.org> Date: Thu, 11 Jan 2007 08:27:34 -0800 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.2pre) Gecko/20061227 SeaMonkey/1.1 MIME-Version: 1.0 To: Cstdenis References: <45A4E3AD.1040600@ctgameinfo.com> In-Reply-To: <45A4E3AD.1040600@ctgameinfo.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (pooker.samsco.org [168.103.85.57]); Thu, 11 Jan 2007 09:27:43 -0700 (MST) X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-scsi@freebsd.org Subject: Re: Bug in aac? X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Jan 2007 16:27:46 -0000 Cstdenis wrote: > I am running 6.1-p11 with a Adaptec SAS RAID 4800SAS running a mirror of > 2 15k rpm SCSI drives. > > Under heavy IO load (Its a database server) I get the following > accompanied by serious system lag: > The system recovers after this? Strange. What the messages mean is that I/O has been sent to the controller, and the controller has not responded in a reasonable period of time. Usually this is a sign of the controller has died and will not recover. So if it is recovering then either there is a firmware bug that is making the controller pause for a long period of time, or there is some sort of yet-undiscovered driver bug. Of course you should make sure that you're running the latest firmware from Adaptec. These cards are new and SAS in general is relatively new, so bugs are not unlikely. One question, though, how are you running SCSI drives on a SAS controller? Are you going through some sort of converter? Scott From owner-freebsd-scsi@FreeBSD.ORG Thu Jan 11 18:10:09 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BF5B616A412 for ; Thu, 11 Jan 2007 18:10:09 +0000 (UTC) (envelope-from cstdenis@ctgameinfo.com) Received: from luna.ctgameinfo.com (luna.ctgameinfo.com [65.110.52.10]) by mx1.freebsd.org (Postfix) with ESMTP id 7176013C46A for ; Thu, 11 Jan 2007 18:10:09 +0000 (UTC) (envelope-from cstdenis@ctgameinfo.com) Received: from [192.168.1.100] (S01060016b606ed02.vc.shawcable.net [24.87.22.207]) (AUTH: LOGIN chris@ctgameinfo.com) by luna.ctgameinfo.com with esmtp; Thu, 11 Jan 2007 10:10:09 -0800 id 00078C7F.45A67D81.0001195F Message-ID: <45A67D56.1080706@ctgameinfo.com> Date: Thu, 11 Jan 2007 10:09:26 -0800 From: Cstdenis User-Agent: Thunderbird 1.5.0.9 (Windows/20061207) MIME-Version: 1.0 To: Scott Long References: <45A4E3AD.1040600@ctgameinfo.com> <45A66576.9010106@samsco.org> In-Reply-To: <45A66576.9010106@samsco.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-scsi@freebsd.org Subject: Re: Bug in aac? X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Jan 2007 18:10:10 -0000 Yes the system does recover. I have not been actively using the system when this happens so I'm not sure how long, but it looks like a few to several minutes. Its a dedicated server at a hosting company -- I don't have physical access to the hardware so I don't know the exact details. The info I gave was a combination of dmesg and what I ordered. Here is what the web control panel says I have Motherboard SuperMicro PDSMI+ Intel Pentium DualCore SingleProc Sata [1Proc] Processor Intel Xeon 3060-Dual Core [2.4GHz] Drive Controller Adaptec 4800SAS SA-SCSI RAID-1 Controller Available upgrades Hard Drive 1 Fujitsu MAX3073 SAS 3073 [73GB] Available upgrades Hard Drive 2 Fujitsu MAX3073 SAS 3073 [73GB] Available upgrades I will try requesting a firmware upgrade. If that doesn't work is there more information I can provide to help get the bug fixed? I tried compiling AAC_DEBUG=3 into the kernel but it made the system unusable with the constant flow of debug data. I worry that AAC_DEBUG=1 will also be too much for the system to be usable, but I'm not sure. Scott Long wrote: > Cstdenis wrote: > >> I am running 6.1-p11 with a Adaptec SAS RAID 4800SAS running a mirror of >> 2 15k rpm SCSI drives. >> >> Under heavy IO load (Its a database server) I get the following >> accompanied by serious system lag: >> >> > > The system recovers after this? Strange. What the messages mean is > that I/O has been sent to the controller, and the controller has not > responded in a reasonable period of time. Usually this is a sign of > the controller has died and will not recover. So if it is recovering > then either there is a firmware bug that is making the controller pause > for a long period of time, or there is some sort of yet-undiscovered > driver bug. Of course you should make sure that you're running the > latest firmware from Adaptec. These cards are new and SAS in general > is relatively new, so bugs are not unlikely. One question, though, how > are you running SCSI drives on a SAS controller? Are you going through > some sort of converter? > > Scott > From owner-freebsd-scsi@FreeBSD.ORG Fri Jan 12 19:25:19 2007 Return-Path: X-Original-To: freebsd-scsi@FreeBSD.org Delivered-To: freebsd-scsi@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B646616A47B; Fri, 12 Jan 2007 19:25:19 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: from mail.garage.freebsd.pl (arm132.internetdsl.tpnet.pl [83.17.198.132]) by mx1.freebsd.org (Postfix) with ESMTP id 2A6EA13C461; Fri, 12 Jan 2007 19:25:19 +0000 (UTC) (envelope-from pjd@garage.freebsd.pl) Received: by mail.garage.freebsd.pl (Postfix, from userid 65534) id 5161948808; Fri, 12 Jan 2007 20:03:27 +0100 (CET) Received: from localhost (154.81.datacomsa.pl [195.34.81.154]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.garage.freebsd.pl (Postfix) with ESMTP id C7EE6487F0; Fri, 12 Jan 2007 20:03:20 +0100 (CET) Date: Fri, 12 Jan 2007 20:02:49 +0100 From: Pawel Jakub Dawidek To: Danny Braniss Message-ID: <20070112190249.GB90718@garage.freebsd.pl> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="s/l3CgOIzMHHjg/5" Content-Disposition: inline In-Reply-To: X-PGP-Key-URL: http://people.freebsd.org/~pjd/pjd.asc X-OS: FreeBSD 7.0-CURRENT i386 User-Agent: mutt-ng/devel-r804 (FreeBSD) X-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail.garage.freebsd.pl X-Spam-Level: X-Spam-Status: No, score=-2.6 required=3.0 tests=BAYES_00 autolearn=ham version=3.0.4 Cc: freebsd-scsi@FreeBSD.org, freebsd-hackers@freebsd.org Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Jan 2007 19:25:19 -0000 --s/l3CgOIzMHHjg/5 Content-Type: text/plain; charset=iso-8859-2 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jan 09, 2007 at 09:06:46AM +0200, Danny Braniss wrote: > Hi, > While I think I have almost solved the problem of network disconnects, > It downed on me a major problem: > When a 'local' disk crashes, the kernel will probably hang/panic/crash. > if i don't try to recover, then there is no change in the above scenario. > if i try to recover, then the client does not know that it should > umount/fsck/mount. > While all this seems familiar, removing a floppy/disk-on-key while it's > mounted, we could always say "you shouldn't have done that!", with > a network connection, it can happen very often - rebooting the target, a > network hickup, etc. >=20 > So, any ideas? In my opinion it should be done this way: You have a queue of I/O requests. You send the to the other end and wait for confirmation. Until confirmation is received, you keep the requests queued. If the other end dies, you try to reconnect (until some timeout expires, the processes which send those requests will just wait), if you reconnect successfully, you resend not-confirmed requests, if you won't be able to reconnect, you just pass the errors up. This is what I did in ggate and it seems to work. --=20 Pawel Jakub Dawidek http://www.wheel.pl pjd@FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! --s/l3CgOIzMHHjg/5 Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (FreeBSD) iD8DBQFFp9tZForvXbEpPzQRAv4EAKD3CxdlCygVo4AgET/J5bD8XZM4dgCgpmCV FUgOAZDi82SVgQSFXu+PqTY= =BHwP -----END PGP SIGNATURE----- --s/l3CgOIzMHHjg/5-- From owner-freebsd-scsi@FreeBSD.ORG Fri Jan 12 19:31:06 2007 Return-Path: X-Original-To: freebsd-scsi@FreeBSD.org Delivered-To: freebsd-scsi@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D6FB016A47E; Fri, 12 Jan 2007 19:31:06 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from cs1.cs.huji.ac.il (cs1.cs.huji.ac.il [132.65.16.10]) by mx1.freebsd.org (Postfix) with ESMTP id 90A3713C480; Fri, 12 Jan 2007 19:31:06 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from pampa.cs.huji.ac.il ([132.65.80.32]) by cs1.cs.huji.ac.il with esmtp id 1H5S7E-000BS0-RR; Fri, 12 Jan 2007 21:31:04 +0200 X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.2 To: Pawel Jakub Dawidek In-reply-to: Your message of Fri, 12 Jan 2007 20:02:49 +0100 . Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Fri, 12 Jan 2007 21:31:04 +0200 From: Danny Braniss Message-ID: Cc: freebsd-scsi@FreeBSD.org, freebsd-hackers@freebsd.org Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Jan 2007 19:31:06 -0000 > > --s/l3CgOIzMHHjg/5 > Content-Type: text/plain; charset=iso-8859-2 > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > On Tue, Jan 09, 2007 at 09:06:46AM +0200, Danny Braniss wrote: > > Hi, > > While I think I have almost solved the problem of network disconnects, > > It downed on me a major problem: > > When a 'local' disk crashes, the kernel will probably hang/panic/crash. > > if i don't try to recover, then there is no change in the above scenario. > > if i try to recover, then the client does not know that it should > > umount/fsck/mount. > > While all this seems familiar, removing a floppy/disk-on-key while it's > > mounted, we could always say "you shouldn't have done that!", with > > a network connection, it can happen very often - rebooting the target, a > > network hickup, etc. > >=20 > > So, any ideas? > > In my opinion it should be done this way: > > You have a queue of I/O requests. You send the to the other end and wait > for confirmation. Until confirmation is received, you keep the requests > queued. If the other end dies, you try to reconnect (until some timeout > expires, the processes which send those requests will just wait), if you > reconnect successfully, you resend not-confirmed requests, if you won't > be able to reconnect, you just pass the errors up. > > This is what I did in ggate and it seems to work. That is basically what i'm doing - unacked request get requed. the problem I fear (and maybe I'm paranoid :-): assume the following scenario, the client(initiator) sends a write command, the target acks it, then it crashes, if the write was never completed, the initiator goes on as nothing ever happened. danny From owner-freebsd-scsi@FreeBSD.ORG Fri Jan 12 20:14:12 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 09EC516A415; Fri, 12 Jan 2007 20:14:12 +0000 (UTC) (envelope-from wb@freebie.xs4all.nl) Received: from smtp-vbr16.xs4all.nl (smtp-vbr16.xs4all.nl [194.109.24.36]) by mx1.freebsd.org (Postfix) with ESMTP id 9654413C474; Fri, 12 Jan 2007 20:14:11 +0000 (UTC) (envelope-from wb@freebie.xs4all.nl) Received: from freebie.xs4all.nl (freebie.xs4all.nl [213.84.32.253]) by smtp-vbr16.xs4all.nl (8.13.8/8.13.8) with ESMTP id l0CJtow6022924; Fri, 12 Jan 2007 20:55:51 +0100 (CET) (envelope-from wb@freebie.xs4all.nl) Received: from freebie.xs4all.nl (localhost [127.0.0.1]) by freebie.xs4all.nl (8.13.8/8.13.3) with ESMTP id l0CJtoGl077324; Fri, 12 Jan 2007 20:55:50 +0100 (CET) (envelope-from wb@freebie.xs4all.nl) Received: (from wb@localhost) by freebie.xs4all.nl (8.13.8/8.13.6/Submit) id l0CJto9I077323; Fri, 12 Jan 2007 20:55:50 +0100 (CET) (envelope-from wb) Date: Fri, 12 Jan 2007 20:55:50 +0100 From: Wilko Bulte To: Danny Braniss Message-ID: <20070112195549.GA77181@freebie.xs4all.nl> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.11 X-Virus-Scanned: by XS4ALL Virus Scanner Cc: freebsd-scsi@freebsd.org, Pawel Jakub Dawidek , freebsd-hackers@freebsd.org Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Jan 2007 20:14:12 -0000 On Fri, Jan 12, 2007 at 09:31:04PM +0200, Danny Braniss wrote.. > > > > --s/l3CgOIzMHHjg/5 > > Content-Type: text/plain; charset=iso-8859-2 > > Content-Disposition: inline > > Content-Transfer-Encoding: quoted-printable > > > > On Tue, Jan 09, 2007 at 09:06:46AM +0200, Danny Braniss wrote: > > > Hi, > > > While I think I have almost solved the problem of network disconnects, > > > It downed on me a major problem: > > > When a 'local' disk crashes, the kernel will probably hang/panic/crash. > > > if i don't try to recover, then there is no change in the above scenario. > > > if i try to recover, then the client does not know that it should > > > umount/fsck/mount. > > > While all this seems familiar, removing a floppy/disk-on-key while it's > > > mounted, we could always say "you shouldn't have done that!", with > > > a network connection, it can happen very often - rebooting the target, a > > > network hickup, etc. > > >=20 > > > So, any ideas? > > > > In my opinion it should be done this way: > > > > You have a queue of I/O requests. You send the to the other end and wait > > for confirmation. Until confirmation is received, you keep the requests > > queued. If the other end dies, you try to reconnect (until some timeout > > expires, the processes which send those requests will just wait), if you > > reconnect successfully, you resend not-confirmed requests, if you won't > > be able to reconnect, you just pass the errors up. > > > > This is what I did in ggate and it seems to work. > > That is basically what i'm doing - unacked request get requed. > the problem I fear (and maybe I'm paranoid :-): Paranoia is a Good Thing(TM) in data storage land :-) > assume the following scenario, the client(initiator) sends a write command, > the target acks it, then it crashes, if the write was never completed, > the initiator goes on as nothing ever happened. Yes, but what can the initiator do about that? I mean, it does not have any visibility of what the target has (or has not) done with the data. ' This is roughly the same as a RAID box accepting a write into a writeback cache and ACK-ing to the host. You can only assume that the RAID box' cache will get flushed to the spindles properly. All the usual horror scenarios with a broken battery backup of the cache and a powerfailure etc apply here. Wilko -- Wilko Bulte wilko@FreeBSD.org From owner-freebsd-scsi@FreeBSD.ORG Fri Jan 12 20:59:50 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D7FBB16A416 for ; Fri, 12 Jan 2007 20:59:50 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 8FD9713C43E for ; Fri, 12 Jan 2007 20:59:50 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id l0CKxJb0031877; Fri, 12 Jan 2007 13:59:24 -0700 (MST) (envelope-from scottl@samsco.org) Message-ID: <45A7F6A4.4030707@samsco.org> Date: Fri, 12 Jan 2007 13:59:16 -0700 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.2pre) Gecko/20061227 SeaMonkey/1.1 MIME-Version: 1.0 To: Wilko Bulte References: <20070112195549.GA77181@freebie.xs4all.nl> In-Reply-To: <20070112195549.GA77181@freebie.xs4all.nl> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (pooker.samsco.org [168.103.85.57]); Fri, 12 Jan 2007 13:59:24 -0700 (MST) X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: freebsd-scsi@freebsd.org, Pawel Jakub Dawidek , freebsd-hackers@freebsd.org Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Jan 2007 20:59:51 -0000 Wilko Bulte wrote: > On Fri, Jan 12, 2007 at 09:31:04PM +0200, Danny Braniss wrote.. >>> --s/l3CgOIzMHHjg/5 >>> Content-Type: text/plain; charset=iso-8859-2 >>> Content-Disposition: inline >>> Content-Transfer-Encoding: quoted-printable >>> >>> On Tue, Jan 09, 2007 at 09:06:46AM +0200, Danny Braniss wrote: >>>> Hi, >>>> While I think I have almost solved the problem of network disconnects, >>>> It downed on me a major problem: >>>> When a 'local' disk crashes, the kernel will probably hang/panic/crash. >>>> if i don't try to recover, then there is no change in the above scenario. >>>> if i try to recover, then the client does not know that it should >>>> umount/fsck/mount. >>>> While all this seems familiar, removing a floppy/disk-on-key while it's >>>> mounted, we could always say "you shouldn't have done that!", with >>>> a network connection, it can happen very often - rebooting the target, a >>>> network hickup, etc. >>>> =20 >>>> So, any ideas? >>> In my opinion it should be done this way: >>> >>> You have a queue of I/O requests. You send the to the other end and wait >>> for confirmation. Until confirmation is received, you keep the requests >>> queued. If the other end dies, you try to reconnect (until some timeout >>> expires, the processes which send those requests will just wait), if you >>> reconnect successfully, you resend not-confirmed requests, if you won't >>> be able to reconnect, you just pass the errors up. >>> >>> This is what I did in ggate and it seems to work. >> That is basically what i'm doing - unacked request get requed. >> the problem I fear (and maybe I'm paranoid :-): > > Paranoia is a Good Thing(TM) in data storage land :-) > >> assume the following scenario, the client(initiator) sends a write command, >> the target acks it, then it crashes, if the write was never completed, >> the initiator goes on as nothing ever happened. > > Yes, but what can the initiator do about that? I mean, it does not have any > visibility of what the target has (or has not) done with the data. ' > > This is roughly the same as a RAID box accepting a write into a writeback cache > and ACK-ing to the host. You can only assume that the RAID box' cache > will get flushed to the spindles properly. All the usual horror scenarios > with a broken battery backup of the cache and a powerfailure etc apply here. > > Wilko > I forget, does iSCSI have a concept of a flush_cache command, or the equivalent of what parallel SCSI does with ordered tags? If so, then that's how your app or OS knows that the transaction got committed to stable storage. It's been long assumed in the external storage world that you are at the mercy of the external storage cache, so the problem that Danny is referring to is nothing new. The real question is how to implement the equivalent mechanism that iSCSI provides in a way that the OS/app can make use of it. For example, CAM issues an ordered tag periodically to flush the disk cache to stable storage. Most storage drivers, including CAM, will issue some sort of a flush_cache command to the controller and media during system shutdown. Scott From owner-freebsd-scsi@FreeBSD.ORG Sat Jan 13 10:13:57 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id EA30E16A412; Sat, 13 Jan 2007 10:13:57 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from cs1.cs.huji.ac.il (cs1.cs.huji.ac.il [132.65.16.10]) by mx1.freebsd.org (Postfix) with ESMTP id 782FE13C4DB; Sat, 13 Jan 2007 10:13:57 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from pampa.cs.huji.ac.il ([132.65.80.32]) by cs1.cs.huji.ac.il with esmtp id 1H5ftb-000Okd-FO; Sat, 13 Jan 2007 12:13:55 +0200 X-Mailer: exmh version 2.7.2 01/07/2005 with nmh-1.2 To: Scott Long In-reply-to: <45A7F6A4.4030707@samsco.org> References: <20070112195549.GA77181@freebie.xs4all.nl> <45A7F6A4.4030707@samsco.org> Comments: In-reply-to Scott Long message dated "Fri, 12 Jan 2007 13:59:16 -0700." Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Sat, 13 Jan 2007 12:13:55 +0200 From: Danny Braniss Message-ID: Cc: Wilko Bulte , Pawel Jakub Dawidek , freebsd-hackers@freebsd.org, freebsd-scsi@freebsd.org Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Jan 2007 10:13:58 -0000 > Wilko Bulte wrote: > > On Fri, Jan 12, 2007 at 09:31:04PM +0200, Danny Braniss wrote.. > >>> --s/l3CgOIzMHHjg/5 > >>> Content-Type: text/plain; charset=iso-8859-2 > >>> Content-Disposition: inline > >>> Content-Transfer-Encoding: quoted-printable > >>> > >>> On Tue, Jan 09, 2007 at 09:06:46AM +0200, Danny Braniss wrote: > >>>> Hi, > >>>> While I think I have almost solved the problem of network disconnects, > >>>> It downed on me a major problem: > >>>> When a 'local' disk crashes, the kernel will probably hang/panic/crash. > >>>> if i don't try to recover, then there is no change in the above scenario. > >>>> if i try to recover, then the client does not know that it should > >>>> umount/fsck/mount. > >>>> While all this seems familiar, removing a floppy/disk-on-key while it's > >>>> mounted, we could always say "you shouldn't have done that!", with > >>>> a network connection, it can happen very often - rebooting the target, a > >>>> network hickup, etc. > >>>> =20 > >>>> So, any ideas? > >>> In my opinion it should be done this way: > >>> > >>> You have a queue of I/O requests. You send the to the other end and wait > >>> for confirmation. Until confirmation is received, you keep the requests > >>> queued. If the other end dies, you try to reconnect (until some timeout > >>> expires, the processes which send those requests will just wait), if you > >>> reconnect successfully, you resend not-confirmed requests, if you won't > >>> be able to reconnect, you just pass the errors up. > >>> > >>> This is what I did in ggate and it seems to work. > >> That is basically what i'm doing - unacked request get requed. > >> the problem I fear (and maybe I'm paranoid :-): > > > > Paranoia is a Good Thing(TM) in data storage land :-) > > > >> assume the following scenario, the client(initiator) sends a write command, > >> the target acks it, then it crashes, if the write was never completed, > >> the initiator goes on as nothing ever happened. > > > > Yes, but what can the initiator do about that? I mean, it does not have any > > visibility of what the target has (or has not) done with the data. ' > > > > This is roughly the same as a RAID box accepting a write into a writeback cache > > and ACK-ing to the host. You can only assume that the RAID box' cache > > will get flushed to the spindles properly. All the usual horror scenarios > > with a broken battery backup of the cache and a powerfailure etc apply here. > > > > Wilko > > > > I forget, does iSCSI have a concept of a flush_cache command, or the > equivalent of what parallel SCSI does with ordered tags? not realy - or I can't find it. iSCSI is mainly and envelope for scsi commands, so whatever the CAM does, it will pass it on. There are some managemenet commands, so the target can tell the initiator that it's going down for example (and what should the driver do in such a case in freebsd?) > If so, then > that's how your app or OS knows that the transaction got committed to > stable storage. It's been long assumed in the external storage world > that you are at the mercy of the external storage cache, so the problem > that Danny is referring to is nothing new. The real question is how > to implement the equivalent mechanism that iSCSI provides in a way that > the OS/app can make use of it. For example, CAM issues an ordered tag > periodically to flush the disk cache to stable storage. nice, (or wishful thinking :-), the scsi part of iSCSI is/can be software/virtual. > Most storage > drivers, including CAM, will issue some sort of a flush_cache command to > the controller and media during system shutdown. this took me a long time to fix! the userland program got killed at shutdown, the link was lost, and so there was no way to flush buffers, fixed by calling fget(...) too. I guess I can summarize: (and use the 3 monkey law :-) 1- assume the target is 'well behaved' and will flush cache. 2- there is - currently - no way to tell the OS that not all seems to be as expected. 3- keep quiet and hope for the best. danny From owner-freebsd-scsi@FreeBSD.ORG Sat Jan 13 17:42:56 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 993FD16A412; Sat, 13 Jan 2007 17:42:56 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id 200CC13C459; Sat, 13 Jan 2007 17:42:56 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.4/8.13.4) with ESMTP id l0DHgSel038730; Sat, 13 Jan 2007 10:42:33 -0700 (MST) (envelope-from scottl@samsco.org) Message-ID: <45A91A02.906@samsco.org> Date: Sat, 13 Jan 2007 10:42:26 -0700 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.2pre) Gecko/20061227 SeaMonkey/1.1 MIME-Version: 1.0 To: Danny Braniss References: <20070112195549.GA77181@freebie.xs4all.nl> <45A7F6A4.4030707@samsco.org> In-Reply-To: X-Enigmail-Version: 0.94.1.2 X-Enigmail-Version: 0.94.1.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (pooker.samsco.org [168.103.85.57]); Sat, 13 Jan 2007 10:42:33 -0700 (MST) X-Spam-Status: No, score=-1.4 required=3.8 tests=ALL_TRUSTED autolearn=failed version=3.1.1 X-Spam-Checker-Version: SpamAssassin 3.1.1 (2006-03-10) on pooker.samsco.org Cc: Wilko Bulte , Pawel Jakub Dawidek , freebsd-hackers@freebsd.org, freebsd-scsi@freebsd.org Subject: Re: iSCSI disconnects dilema X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Jan 2007 17:42:56 -0000 Danny Braniss wrote: >> I forget, does iSCSI have a concept of a flush_cache command, or the >> equivalent of what parallel SCSI does with ordered tags? > > not realy - or I can't find it. iSCSI is mainly and envelope for > scsi commands, so whatever the CAM does, it will pass it on. > There are some managemenet commands, so the target can tell the initiator > that it's going down for example (and what should the driver > do in such a case in freebsd?) > If the periph is open (i.e. mounted), I'd just ignore this and have the stack go through a normal retry timeout cycle to see if the device comes back. If it's closed, then I'd remove the periph. Knowing if it's opened or closed is likely hard to do from the iSCSI driver, which is one reason why iSCSI knowledge needs to eventually be moved upwards in CAM. >> If so, then >> that's how your app or OS knows that the transaction got committed to >> stable storage. It's been long assumed in the external storage world >> that you are at the mercy of the external storage cache, so the problem >> that Danny is referring to is nothing new. The real question is how >> to implement the equivalent mechanism that iSCSI provides in a way that >> the OS/app can make use of it. For example, CAM issues an ordered tag >> periodically to flush the disk cache to stable storage. > nice, (or wishful thinking :-), the scsi part of iSCSI is/can be > software/virtual. > If the target device returns a successful completion from a command, the initiator must assume that it's not lying. You could do a flush/sync cache command after every I/O, but then you'd have a completely unacceptable level of performance. But again, this is not a new problem specific to iSCSI. It's long been a design consideration of external storage, and is why external storage 1) carries a high price tag to accompany good engineering and testing, and 2) comes with some form of battery backup, to prevent data loss in case of power loss. >> Most storage >> drivers, including CAM, will issue some sort of a flush_cache command to >> the controller and media during system shutdown. > > this took me a long time to fix! the userland program got killed at shutdown, > the link was lost, and so there was no way to flush buffers, fixed by calling > fget(...) too. > > I guess I can summarize: (and use the 3 monkey law :-) > 1- assume the target is 'well behaved' and will flush cache. > 2- there is - currently - no way to tell the OS that not all > seems to be as expected. > 3- keep quiet and hope for the best. > danny > > So you had a scenario where a program was doing I/O right up to system (initiator) shutdown, and some of those I/O's got lost in the process? I guess I don't understand why the OS didn't flush all outstanding I/O buffers after terminating the program and before finishing the shutdown. Maybe you are doing something illegal in your driver, or maybe you need to implement a kernel shutdown hook that will allow you to block the shutdown until everything is flushed. Scott From owner-freebsd-scsi@FreeBSD.ORG Sat Jan 13 18:53:54 2007 Return-Path: X-Original-To: freebsd-scsi@freebsd.org Delivered-To: freebsd-scsi@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9DC5116A403; Sat, 13 Jan 2007 18:53:54 +0000 (UTC) (envelope-from gibbs@scsiguy.com) Received: from ns1.scsiguy.com (mail.scsiguy.com [70.89.174.89]) by mx1.freebsd.org (Postfix) with ESMTP id 56B0A13C428; Sat, 13 Jan 2007 18:53:54 +0000 (UTC) (envelope-from gibbs@scsiguy.com) Received: from [70.89.174.89] (www.scsiguy.com [70.89.174.89]) by ns1.scsiguy.com (8.13.8/8.13.8) with ESMTP id l0DII5h0015549; Sat, 13 Jan 2007 11:18:05 -0700 (MST) (envelope-from gibbs@scsiguy.com) Message-ID: <45A9225D.4080907@scsiguy.com> Date: Sat, 13 Jan 2007 11:18:05 -0700 From: "Justin T. Gibbs" User-Agent: Thunderbird 1.5.0.8 (X11/20061214) MIME-Version: 1.0 To: mjacob@freebsd.org References: <20070104225519.Q92958@ns1.feral.com> <459E8AE7.90104@samsco.org> <20070105093930.Y34456@ns1.feral.com> <459E97E6.4000603@samsco.org> <459E989C.2020602@samsco.org> <20070105103431.A34456@ns1.feral.com> <20070105104021.D34456@ns1.feral.com> In-Reply-To: <20070105104021.D34456@ns1.feral.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-scsi@freebsd.org Subject: Re: CAM rescanner thread? X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Jan 2007 18:53:54 -0000 > Actually, no. Now that I think about it and look at the code in > cam_xpt.c, AC_FOUND_DEVICE seems to have a different semantic. It > seems to be an announcement to all periph's who care *after* the > device has been probed and configured. Yes. > If you look at xpt_async itself, it walks existing target and device > entries delivering the async_code. Even if the path for the async > event is a wildcard, it still needs a cam_ed to deliver something > to. There are wildcard cam_ed's in the tree that allow callbacks to be registered for events that happen at any level - even a fully wildcarded path. > The broadcast async stuff appears like it is *thinking* about having > this done. In fact, code in da (daasync) seems to want to do this- > but it requires initial inquiry data (via a ccb_getdev argument) > which really makes me scratch my head a bit. AC_FOUND_DEVICE should only be issued once the transport layer believes that a device is configured sufficiently to be used. > This is a prime example of how not having a mind-meld with Ken or > Justin really hurts. We can ask them what they were thinking about > this, and it'll probably make sense, but because this isn't all > very highly documented the architecture is often what you *guess* > it is :-). When the CAM code for FreeBSD was originally written, CAM3 was in development but not quite out yet. The draft documents contain some fledgling support for dynamic configuration and binding operations that dissassociate physical from logic addressing. You can still get CAM-3 here: http://www.t10.org/ftp/t10/drafts/cam3/cam3r03.pdf It's discovery and bind CCB types may be a good starting point for addressing these issues. With the discovery process moved to a thread and some augmentation to XPT_SCAN_*, we should be good enough for now. The only tricky part about using CCBs to intiate scanning is that they potentially require the allocation of memory from interrupt context. It would be nice to provide a service to all SIMs that can perform dynamic discovery such that they have a high probability of attaining their CCB in these situations. -- Justin