From owner-freebsd-current@FreeBSD.ORG Tue May 8 23:08:17 2007 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 281B716A402 for ; Tue, 8 May 2007 23:08:17 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.freebsd.org (Postfix) with ESMTP id CD21513C44C for ; Tue, 8 May 2007 23:08:16 +0000 (UTC) (envelope-from scottl@samsco.org) Received: from phobos.samsco.home (phobos.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.13.8/8.13.8) with ESMTP id l48N88DB004389; Tue, 8 May 2007 17:08:09 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <464102D1.2000706@samsco.org> Date: Tue, 08 May 2007 17:08:01 -0600 From: Scott Long User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.2pre) Gecko/20070111 SeaMonkey/1.1 MIME-Version: 1.0 To: Barrett Lyon References: <9FC464A4-4405-4C10-A7CB-0A424EA4EAD3@blyon.com> <602A8820-F05C-457A-A20A-E258BD0FEDC5@blyon.com> In-Reply-To: <602A8820-F05C-457A-A20A-E258BD0FEDC5@blyon.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (pooker.samsco.org [168.103.85.57]); Tue, 08 May 2007 17:08:09 -0600 (MDT) X-Spam-Status: No, score=-1.4 required=5.5 tests=ALL_TRUSTED autolearn=failed version=3.1.8 X-Spam-Checker-Version: SpamAssassin 3.1.8 (2007-02-13) on pooker.samsco.org Cc: adam radford , freebsd-current@freebsd.org Subject: Re: Functional RAID controller? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 May 2007 23:08:17 -0000 Barrett Lyon wrote: >> If you have "a good idea what's wrong with the twa driver", would you >> mind >> sharing a stack trace or other information? So far I have only been >> told that >> "system hangs when I do heavy I/O". This is _not_ reproducable here. >> Have you run memtest86 on the machine? Have you run a PCI analyzer on >> your machine to see who is on the PCI bus before/during the hang? > > We have done everything including asking to bring the machines that are > crashing to AMCC's offices which are down the street. I have not been > doing the technical debugging but a few members of AMCC's staff have > been trying to help. We've been running memtest, etc. When the > machines hang there are no debugging options, it's completely frozen > without any details pointing to why. Its not clear from that condition > whether the problem is due to an unacknowledged interrupt or a mutex > deadlock of some sort. We are assuming that in this case it is due to > the driver trying to do work assuming the interrupt is valid and getting > stuck or returning early before the interrupt is acknowledged, causing > it to trigger over and over and over. > > If you want to see it reproduced, we are more than happy to provide you > two machines that both have this condition. > >> You claim the hang doesn't happen on the 6.2 series twa driver, >> the driver changes between the 6.x and 7.x twa driver are _very_ minimal, >> some simple time keeping changes, and some XPT_* path inquiry handling >> changes. > > Under 6.x the systems as built function completely stable. > >> I am really surprised that you are trying to design servers around the >> FreeBSD un-stable kernel. > > There are other reasons for this which I don't want to discuss here, but > the other components we are using work very well within 7.0 and we have > a lot of performance gains that make it worth using a development > kernel. The 10GbE drivers like mxge are having a lot of development > work done in HEAD and as a result the 6.x is getting left behind on some > of the work we are doing. At the very least, I want to make sure I > deploy hardware that will function beyond 6.x. > > > -Barrett The biggest difference between 7-CURRENT and 6-STABLE right now in this space is the MPSAFE work in CAM. It should have been a complete NO-OP for the 3ware driver, but it's always possible that either I overlooked something, or the driver was doing something screwy before that was unsafe, and it's now being caught. I'll look at this tonight, as well as look at committing the update that Adam mentioned (sorry Adam!). My 3ware hardware inventory is very limited, so if I can't spot the problem by code inspection then I'll need to work with you and Adam to help narrow it down. Scott