From owner-freebsd-current@FreeBSD.ORG Thu Sep 16 20:21:21 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 58A1C16A4CE for ; Thu, 16 Sep 2004 20:21:21 +0000 (GMT) Received: from pooker.samsco.org (pooker.samsco.org [168.103.85.57]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9AC0943D31 for ; Thu, 16 Sep 2004 20:21:18 +0000 (GMT) (envelope-from scottl@samsco.org) Received: from [192.168.254.11] (junior-wifi.samsco.home [192.168.254.11]) (authenticated bits=0) by pooker.samsco.org (8.12.11/8.12.10) with ESMTP id i8GKKZf7009048; Thu, 16 Sep 2004 14:20:35 -0600 (MDT) (envelope-from scottl@samsco.org) Message-ID: <4149F4F4.1020007@samsco.org> Date: Thu, 16 Sep 2004 14:17:56 -0600 From: Scott Long User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.7.2) Gecko/20040831 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Kevin Oberman References: <20040916183641.3A0FB5D04@ptavv.es.net> In-Reply-To: <20040916183641.3A0FB5D04@ptavv.es.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, hits=0.2 required=3.8 tests=SUBJ_HAS_UNIQ_ID autolearn=no version=2.63 X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on pooker.samsco.org cc: Mike Jakubik cc: DanGer cc: current@freebsd.org cc: =?ISO-8859-1?Q?S=F8ren_Schmidt?= Subject: Re: ad0: TIMEOUT - READ_DMA retrying (2 retries left) LBA=207594611 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Sep 2004 20:21:21 -0000 Kevin Oberman wrote: >>Date: Wed, 15 Sep 2004 15:05:34 -0600 >>From: Scott Long >>Sender: owner-freebsd-current@freebsd.org >> >>Søren Schmidt wrote: >> >>>Mike Jakubik wrote: >>> >>> >>>>Søren Schmidt said: >>>> >>>> >>>> >>>>>You are having massive ICRC problems which are different and most likely >>>>>due to bad cables/connectors or cables that are turned around (blue >>>>>connector at controller, black/grey at devices), or it can be a >>>>>weak/overloaded PSU. >>>>> >>>> >>>>This is a different error message from what everyone else, including >>>>me is >>>>reporting. What about the errors we are getting? >>> >>> >>>I have no idea, I can't reproduce the problem at all. However I suspect >>>somthing else is blocking interrupt delivery but its just a hunch... >>> >>>-Søren >>> >> >>I'm finding it hard to imagine a scenario where a timeout could fire but >>not a hardware interrupt. Nothing usually shares the interrupt vectors >>with ATA, so it's pretty unlikely that the ata ithread is being blocked >>by anything but itself. > > > This sounds reasonable, but I can make the problem start/stop by > starting/stopping the network card. No problems in single-user. Then I > 'ifconfig xl0 192.116.1.1' and immediately start getting the errors. I > also get watchdog timeouts on xl0. 'ifconfig xl0 down' stops the errors. > xl0 is on IRQ10, ata1 is on IRQ15. I have a K6 processor in an ASUS P5A > with neither SMP or APIC. (I am running ACPI, not that there is much to > it on this system.) > > While I don't entirely discount the possibility that this is in ata, it > seems odd that I get no errors even doing a buildworld as long as the > network is off. > > This started pretty recently, but changes have been made in the period > of suspicion to the scheduler, ACPI, and ata, so it's still fuzzy. My > system gets the errors consistently enough that I will try to narrow > down what patch caused the problem. (Wish it was a bit faster to build > kernels, though!) I have a feeling in the pit of my stomach that it's > going to show up at with a scheduler patch MT5, but I hope I'm wrong! I > think I'd prefer an ATA problem to a scheduler issue. (Of course, Søren > probably has a differing opinion on this.) ATA commands are either completed in the bio_taskqueue or in a normal taskqueue. The bio_taskqueue runs in the g_up kthread while normal taskqueues run in an swi kthread that multiplexes all of the registered tasks. Network drivers that are registered with IFF_NEEDSGIANT use a taskqueue to help decouple the locking, and it could be that they are stalling other tasks from running. This doesn't seem to be the case with xl(4). However, the normal path for completing commands is with the bio_taskqueue which should have no interaction at all with the network side. So either something else in the network stack is using a taskqueue and using it pretty inefficiently, or preemption is general is causing g_up to not run as often as it should. I think that the untimeout of each command should be done in the interupt handler and not in the taskqueue/bio_taskqueue. Tasks don't get lost out of either and will eventually run (or you'll have much bigger problems if they don't) no matter what, so it's misleading to say that a command timed out when really the hardware responded but the system didn't get to the taskqueue fast enough. In general I don't like taskqueues anyways because they are non-deterministic; they really are not good for anything that is time-critical. Once we move to a scheme were each device instance has its own ithread (i.e. no more sharing), there won't be a need for taskqueues except to handle unusual/expensive and non-time-critical events. Scott