From owner-freebsd-stable@FreeBSD.ORG Tue Jun 28 12:58:47 2005 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0D70616A41C for ; Tue, 28 Jun 2005 12:58:47 +0000 (GMT) (envelope-from smckay@internode.on.net) Received: from ash25e.internode.on.net (ash25e.internode.on.net [203.16.214.182]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5D9A643D4C for ; Tue, 28 Jun 2005 12:58:46 +0000 (GMT) (envelope-from smckay@internode.on.net) Received: from dungeon.home (ppp116-218.lns1.bne3.internode.on.net [59.167.116.218]) by ash25e.internode.on.net (8.12.9/8.12.6) with ESMTP id j5SCwhsl083973; Tue, 28 Jun 2005 22:28:44 +0930 (CST) (envelope-from smckay@internode.on.net) Received: from dungeon.home (localhost [127.0.0.1]) by dungeon.home (8.13.1/8.11.6) with ESMTP id j5SCw46l010235; Tue, 28 Jun 2005 22:58:04 +1000 (EST) (envelope-from mckay) Message-Id: <200506281258.j5SCw46l010235@dungeon.home> To: Peter Jeremy References: <200506241231.j5OCV6jp047730@dungeon.home> <20050624213433.GA50157@cirb503493.alcatel.com.au> In-Reply-To: <20050624213433.GA50157@cirb503493.alcatel.com.au> from Peter Jeremy at "Sat, 25 Jun 2005 07:34:33 +1000" Date: Tue, 28 Jun 2005 22:58:04 +1000 From: Stephen McKay Cc: stable@freebsd.org, Stephen McKay Subject: Re: Data corruption in cd9660 on FreeBSD 4.11? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jun 2005 12:58:47 -0000 I haven't finished all the suggested tests, but since I'm taking so long to do so, I thought I should send what I have so far. On Saturday, 25th June 2005, Peter Jeremy wrote: >On Fri, 2005-Jun-24 22:31:06 +1000, Stephen McKay wrote: >>I'm experiencing data corruption when reading CDs and DVDs on FreeBSD 4.11. >... >>So, can anyone suggest any more tests I could try? Or is there a kind of >>hardware fault that could cause this substitution of whole blocks read from >>CDs without causing any other problems? > >You might like to post the relevant sections of a verbose boot - the >ATA and CD probes. I've appended it to this messages, so that the flow is not ruined. Note that I am not currently using ATAPI-CAM for my tests. I am using /dev/acd0a and /dev/acd1a to mount the CDs in the DVD-ROM and DVD-R respectively. Also the "non-ATA66 cable" thing is true; it is a plain ATA33 cable. >Are you running the CD/DVD drives in PIO or UDMA modes? I normally run both DVD drives at UDMA33. My test runs normally fail every 2nd or 3rd run. I've seen it do 5 OK runs in a row once though, so I don't yet have a very good test. I tested with PIO4 and ran 12 consecutive test runs without error. It was a little slower at 150 seconds per run instead of the normal 135, possibly because 75% to 80% of the cpu was dedicated to interrupt handling (doing pio, I assume). It seems that either DMA or ATA interrupts (or maybe both) are required to cause the problem. Also, I tried some tests with the noclusterr mount option on the CD. The test ran much slower (approx 232 seconds instead of 135) but I also saw no failures (with only 6 test runs though as I was pressed for time). The noclusterr option is interesting because it defeats read clustering resulting in the ATA driver issuing only 2K reads instead of up to 64K at a time. I assume that the 64K reads would require scatter-gather DMA, so maybe this is relevant to the problem. Oddly, I noticed that a fixed value of 65534 is found in atapi-all.c as a request size limit. No, not 65536 = 2^16, but 2 bytes less. Puzzling. >Have you tried anything other than ISO9660 filesystems on a physical CD? I have not tried anything but cd9660 file systems on CDs and DVDs. I will see if I can build a UFS file system to test with, when I get a chance. >What happens if you just dd the CD-ROM? When I dd the CD-ROM it seems to work correctly. I have done this only infrequently however, so I may just be lucky to not have had a failure. I've now done 6 consecutive dd reads of my test CD-ROM in UDMA33 mode with no errors. It only takes 125 seconds, so it's a bit faster than comparing directory trees. Only 6 tests isn't many, so I'll do more later, this time with other system activity. >What happens if you use a vnode >mount (see vnconfig(8)) of an ISO filesystem sitting in a UFS filesystem? I'll test this when I get a chance. >Anything unusual in your kernel config file? Nothing too unusual. I'm running a uni-processor kernel with HTT disabled. I skimmed through my config and this is the only interesting thing: HZ=500 I don't think that's too dangerous. On the other hand, it does increase the rate of interrupts, and if there's a race somewhere, it may make it worse. >Have you tried building a kernel with WITNESS and/or DIAGNOSTIC? I'm now running with INVARIANTS, INVARIANT_SUPPORT, and DIAGNOSTIC on 4.11. No change in the failure rate and no significant slowdown either. >Any chance of you repeating the tests with a 5.x system? Maybe >on a spare small partition or using a 5.4-RELEASE disk1 as a live >filesystem. I was experimenting with current in late April, so I installed that drive for testing. So far, I have not been able to reproduce the failure on April's current though I've only had time for a quick run of 6 repetitions. Current takes the same time (135 seconds, on average) to read and compare the CD. That seems good, considering all the debugging is still enabled. I'm pretty sure that ATA MK III is in this kernel. Sadly, it panics immediately if I run "atacontrol mode 1" so I'm just assuming it is running in DMA mode by the speed of it. (And I have hw.ata.atapi_dma=1 in /boot/loader.conf). That's where I'm up to so far in stress testing. Right now I'm trying to understand some unusual looking code in ata_dmasetupd_cb() in 4.11's ata-dma.c. The attached comment is "A maximum segment size was specified for bus_dma_tag_create, but some busdma code does not seem to honor this, so fix up if needed." The "fix-up" code seems to be gone in current, so it looks suspicious to me. When I work out what it does, I'll report back. Stephen. ------------------------------------------------------------------ Verbose boot of 4.11-p10 (the ata related parts, at least): atapci0: port 0xfc00-0xfc0f,0-0x3,0-0x7,0-0x3,0-0x7 irq 0 at device 31.1 on pci0 ata0: iobase=0x01f0 altiobase=0x03f6 bmaddr=0xfc00 ata0: mask=03 ostat0=50 ostat2=00 ata0-master: ATAPI 00 00 ata0-slave: ATAPI 00 00 ata0: mask=03 stat0=50 stat1=00 ata0-master: ATA 01 a5 ata0: devices=01 ata0: at 0x1f0 irq 14 on atapci0 ata1: iobase=0x0170 altiobase=0x0376 bmaddr=0xfc08 ata1: mask=03 ostat0=50 ostat2=50 ata1-slave: ATAPI 14 eb ata1-master: ATAPI 14 eb ata1: mask=03 stat0=00 stat1=00 ata1: devices=0c ata1: at 0x170 irq 15 on atapci0 pci0: (vendor=0x8086, dev=0x24d3) at 31.3 irq 10 ata-: ata0 exists, using next available unit number ata-: ata1 exists, using next available unit number Trying Read_Port at 203 Trying Read_Port at 243 Trying Read_Port at 283 Trying Read_Port at 2c3 Trying Read_Port at 303 Trying Read_Port at 343 Trying Read_Port at 383 Trying Read_Port at 3c3 isa_probe_children: disabling PnP devices isa_probe_children: probing non-PnP devices orm0: