From owner-freebsd-current@FreeBSD.ORG Sun Aug 14 17:57:45 2005 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 90E0116A41F for ; Sun, 14 Aug 2005 17:57:45 +0000 (GMT) (envelope-from Chris@LainOS.org) Received: from mail.neovanglist.net (blackacid.neovanglist.net [69.16.150.4]) by mx1.FreeBSD.org (Postfix) with ESMTP id F0A5543D48 for ; Sun, 14 Aug 2005 17:57:44 +0000 (GMT) (envelope-from Chris@LainOS.org) Received: from localhost (localhost.neovanglist.net [127.0.0.1]) by mail.neovanglist.net (Postfix) with ESMTP id A14EF6D458 for ; Sun, 14 Aug 2005 10:56:16 -0700 (MST) Received: from mail.neovanglist.net ([127.0.0.1]) by localhost (blackacid.neovanglist.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 70793-03 for ; Sun, 14 Aug 2005 10:56:13 -0700 (MST) Received: from melchior.neovanglist.net (cpe.atm2-0-1081027.0x50c4e512.bynxx14.customer.tele.dk [80.196.229.18]) (using SSLv3 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by mail.neovanglist.net (Postfix) with ESMTP id 45B3D6D432 for ; Sun, 14 Aug 2005 10:56:13 -0700 (MST) From: Chris Gilbert To: freebsd-current@freebsd.org Date: Sat, 13 Aug 2005 23:21:36 +0200 User-Agent: KMail/1.8 MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200508132321.37654.Chris@LainOS.org> X-Virus-Scanned: amavisd-new at neovanglist.net Subject: Re: Panic during install on Sparc64 - Only with large HDD X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Chris@LainOS.org List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 14 Aug 2005 17:57:45 -0000 Well, I've continued looking into this problem as I really _really_ want to see it fixed for 6.0-RELEASE. I did some general device stress-testing to make sure that is was directly triggerable and reproducible, and was not just an intermittent failure. I have successfully created, and installed FreeBSD on (without any errors): /dev/ad0a /dev/ad0b /dev/ad0c /dev/ad0d /dev/ad0e /dev/ad0f Even though the newfs on it failed, creating the slice itself worked for my large partition (/dev/ad0g). Therefore, I can dd data to it, but I can't write a UFS filesystem to it in order to mount. I then went about writing data to this filesystem for long periods of time to try and hit the problem: # time dd if=/dev/urandom of=/dev/ad0g 143337401+0 records in 143337401+0 records out 73388749312 bytes transferred in 89392.318911 secs (820974 bytes/sec) 614.444u 41826.640s 24:49:52.35 47.4% 244+1708k 0+0io 0pf+0w After this ran without a single error for about 20 hours, I stopped it and started trying to hit the block that triggered the issue manually. After a few hours of "double and half(ing) " I finally managed to find the block: # dd count=1 obs=1024 seek=93321655 if=/dev/urandom of=/dev/ad0g 1+0 records in 0+1 records out 512 bytes transferred in 0.001470 secs (348278 bytes/sec) This one was successful... but the very next one: # dd count=1 obs=1024 seek=93321656 if=/dev/urandom of=/dev/ad0g ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=268435456 ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=268435456 ad0: FAILURE - WRITE_DMA timed out LBA=268435456 dd: /dev/ad0g: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 16.453833 secs (0 bytes/sec) And incrementing this by one block shows: # dd count=1 obs=1024 seek=93321657 if=/dev/urandom of=/dev/ad0g ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=268435458 ad0: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=268435458 ad0: FAILURE - WRITE_DMA timed out LBA=268435458 dd: /dev/ad0g: Input/output error 1+0 records in 0+0 records out 0 bytes transferred in 16.452722 secs (0 bytes/sec) This makes perfect sense because my block size is specified at 1024 on the dd command, and the default blocksize is 512. Therefore, incrementing it by a single 1024 size block would return 2 blocks further in the LBA. ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=268435456 (then...) ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=268435458 Bingo! We've finally found the wall! I'm going to look further into the IDE chipset (atapci0: ) tonight. Both for it's whitepapers (To see if it has some sort of quirk or limitation around this area.) and it's FreeBSD driver, to see if something funky is going on. As I said before, if anyone is interesting in helping me resolve this I would appreciate it greatly. This is a bug which has haunted me and several others since FreeBSD 5.2-RC2 and it needs to be fixed. -- Thanks, Chris (Lance) Gilbert Ph: +45 33 73 29 31 (UTC +0100)