Date: Sun, 17 Feb 2008 07:22:43 -0800 From: Jeremy Chadwick <koitsu@freebsd.org> To: "Carlos A. M. dos Santos" <unixmania@gmail.com> Cc: freebsd-stable@freebsd.org Subject: Re: ATA -- erratic behaviour when removing disk Message-ID: <20080217152243.GA51742@eos.sc1.parodius.com> In-Reply-To: <e71790db0802170603h50b211d7w8261f02ea558fa6c@mail.gmail.com> References: <20080216210736.GA17517@eos.sc1.parodius.com> <e71790db0802161508p7004144dy817e59d7e454cc43@mail.gmail.com> <20080217061905.GA35005@eos.sc1.parodius.com> <e71790db0802170603h50b211d7w8261f02ea558fa6c@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Feb 17, 2008 at 11:03:31AM -0300, Carlos A. M. dos Santos wrote: > On Feb 17, 2008 3:19 AM, Jeremy Chadwick <koitsu@freebsd.org> wrote: > > On Sat, Feb 16, 2008 at 09:08:38PM -0200, Carlos A. M. dos Santos wrote: > > > On Feb 16, 2008 7:07 PM, Jeremy Chadwick <koitsu@freebsd.org> wrote: > > > > Is anyone aware of the situation where FreeBSD behaves erratically when > > > > a disk is physically removed without "atacontrol detach ataX" being run > > > > prior to removal (at least on RELENG_7)? > > > > > > Yes, I have seen this since 4.5, IIRC. > > > > Wonderful. > > > > > > Also FWIW: I also tested all this for comparison on Ubuntu Linux earlier > > > > this morning. I was able to yank the disk in the middle of an I/O > > > > operation, resulting in an immediate I/O error from dd. I took no > > > > precautions prior to yanking the disk. Upon reinsertion, the system > > > > found the disk and I could continue I/O operations on it as if it had > > > > never been removed. Only reason I'm pointing this out is that it > > > > confirms the issue isn't hardware or with vendor implementation, but > > > > rather specific to the OS. > > > > > > Congratulations to the Linux folks. Or not, since this looks like a > > > very risky behavior. Who warrants you that the *same* disk was plugged > > > back? Blindly continuing to write could easily corrupt the contents of > > > the second drive. > > > > I'm not sure I understand. There were no filesystems on the drive, and > > nothing mounted prior to removal: just like what I did with FreeBSD. > > The procedure: > > > > * Boot Ubuntu CD, get a shell > > * dd if=/dev/sdb of=/dev/null bs=8k > > * In the middle of I/O, yank the disk > > * dd exits with "I/O error" > > * System continued to be responsive; no ATA subsystem oddities > > * Reinserted disk; kernel saw the disk without any issue > > * dd if=/dev/sdb of=/dev/null bs=8k > > * I/O still operating as before; no system "oddities" > > I'm still not convinced that this is a valid use case. The fact of the matter is this: If while running FreeBSD, a disk (with no mounted filesystems) is lost or removed WITHOUT using "atacontrol detach" first, the entire ATA subsystem goes into a state of confusion, and remains broken until a reboot. Even a full reinit of the ATA channel does not solve the problem. If while running FreeBSD, someone uses "atacontrol detach channel" to disconnect an ATA channel (and thus the disk) PRIOR to the disk being lost or the disk being removed, FreeBSD behaves properly. Upon re- insertion of the disk, and "atacontrol attach", the disk is found on the controller. The reason I performed the same test on Linux is because I wanted to rule out "hardware issues" or "vendor implementation issues". By using another operating system, I've managed to confirm that there is indeed a bug within the ATA subsystem on FreeBSD. > > If you'd like, I can try inserting a completely different disk (both in > > size and vendor), but I really don't think anything odd will happen. If > > there were filesystems mounted or other whatnots, yes, I could see how > > there might be concern. [...] > > This is exctly what I was talking about when I said "risky". But this is a moot point anyways, because FreeBSD kernel panics when any storage medium is removed while a filesystem is still mounted. It doesn't matter if it's ATA, SCSI, USB umass, firewire, whatever. It panics. This is apparently by design, and thus I can't argue against it. But at least when there's a panic, the risk is significantly lower, since the system is dead in the water from that point on. > > [...] I can try that as well if you're interested. I > > am a bit curious to see what Linux does if I pull a disk that has > > mounted filesystems which are being accessed at the time. > > Yes, do it please. I must admit [precious brat mode on] that I'm very > curious about what would happen. :-) You got it: * Netbooted Ubuntu Server, spawned a shell * fdisk /dev/sda * Created a primary partition, all cylinders, type 0x83 (Linux) ~ # mke2fs /dev/sda1 {snip lots of output} ~ # mount /dev/sda1 /mnt ~ # dd if=/dev/urandom of=/mnt/testfile bs=8M count=1 1+0 records in 1+0 records out ~ # ls -l /mnt/testfile -rw-r--r-- 1 root root 8388608 Feb 17 15:09 /mnt/testfile ~ # md5sum /mnt/testfile 9745a2c5455d23be8bae300b7323ed49 /mnt/testfile ~ # while cp -f /mnt/testfile /mnt/testfile.copy; do sleep 0; done * Yanked disk cp: Write Error: Input/output error ~ # ls -l /mnt/testfile* ls: /mnt/testfile: Input/output error ls: /mnt/testfile.copy: Input/output error ~ # df -k Filesystem 1k-blocks Used Available Use% Mounted on tmpfs 1037424 40 1037384 0% /dev tmpfs 1037424 40 1037384 0% /.dev /dev/sda1 192292124 69004 182455200 0% /mnt ~ # dmesg {see output below my signature; just trying to keep this part clear} * Reattached disk; disk is now /dev/sdc, which apparently can happen based on what I've read about libata {see output below my signature} ~ # mkdir /mnt2 ~ # mount /dev/sdc1 /mnt2 [ 691.740671] EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended ~ # df -k Filesystem 1k-blocks Used Available Use% Mounted on tmpfs 1037424 40 1037384 0% /dev tmpfs 1037424 40 1037384 0% /.dev /dev/sda1 192292124 69004 182455200 0% /mnt /dev/sdc1 192292124 69204 182455000 0% /mnt2 ~ # cd /mnt2 /mnt2 # ls -l drwx------ 2 root root 16384 Feb 17 15:08 lost+found -rw-r--r-- 1 root root 8388608 Feb 17 15:09 testfile -rw-r--r-- 1 root root 6324224 Feb 17 15:12 testfile.copy * Then I decided to poke at /mnt a bit to see what would happen... ~ # ls -l /mnt ls: /mnt/lost+found: Input/output error ls: /mnt/testfile: Input/output error ls: /mnt/testfile.copy: Input/output error ~ # umount /mnt ~ # df -k Filesystem 1k-blocks Used Available Use% Mounted on tmpfs 1037424 40 1037384 0% /dev tmpfs 1037424 40 1037384 0% /.dev /dev/sdc1 192292124 69204 182455000 0% /mnt2 I don't see anything wrong with the above scenario. Yes, testfile.copy is smaller than the original, but that's expected -- cp exited with an I/O error, which is good, because that's what I'd expect to happen under such circumstances. If it blindly kept working, that WOULD be bad. Anything else you want me to try? I seem pretty satisfied with the reliability of this, which is why I would like to see said issue in FreeBSD addressed. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | [ 465.905156] ata3: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0x2 frozen [ 465.905160] ata3: (irq_stat 0x00400000, PHY RDY changed) [ 468.115595] ata3: soft resetting port [ 468.115602] ata3: SATA link down (SStatus 11 SControl 300) [ 468.115609] ata3: failed to recover some devices, retrying in 5 secs [ 473.117426] ata3: hard resetting port [ 473.841112] ata3: SATA link down (SStatus 0 SControl 300) [ 473.841119] ata3.00: limiting speed to UDMA/133:PIO3 [ 473.841122] ata3: failed to recover some devices, retrying in 5 secs [ 478.842942] ata3: hard resetting port [ 479.566632] ata3: SATA link down (SStatus 0 SControl 300) [ 479.566637] ata3.00: disabled [ 479.566642] sd 2:0:0:0: rejecting I/O to offline device [ 479.566646] sd 2:0:0:0: rejecting I/O to offline device [ 479.566652] Buffer I/O error on device sda1, logical block 0 [ 479.566655] lost page write due to I/O error on sda1 [ 479.566660] Buffer I/O error on device sda1, logical block 1 [ 479.566663] lost page write due to I/O error on sda1 [ 479.566751] sd 2:0:0:0: rejecting I/O to offline device [ 479.566753] Buffer I/O error on device sda1, logical block 1025 [ 479.566755] lost page write due to I/O error on sda1 [ 479.566757] sd 2:0:0:0: rejecting I/O to offline device [ 479.566759] Buffer I/O error on device sda1, logical block 1027 [ 479.566761] lost page write due to I/O error on sda1 [ 479.566764] sd 2:0:0:0: rejecting I/O to offline device [ 479.566772] sd 2:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK [ 479.566776] end_request: I/O error, dev sda, sector 65695 [ 479.566778] Buffer I/O error on device sda1, logical block 8204 [ 479.566780] lost page write due to I/O error on sda1 [ 479.567404] sd 2:0:0:0: rejecting I/O to offline device [ 479.574627] sd 2:0:0:0: rejecting I/O to offline device [ 479.574630] Buffer I/O error on device sda1, logical block 0 [ 479.574631] lost page write due to I/O error on sda1 [ 479.574749] EXT2-fs error (device sda1): read_block_bitmap: Cannot read block bitmap - block_group = 0, block_bitmap = 1025 [ 479.574814] sd 2:0:0:0: rejecting I/O to offline device [ 479.574889] sd 2:0:0:0: rejecting I/O to offline device [ 479.574892] Buffer I/O error on device sda1, logical block 0 [ 479.574894] lost page write due to I/O error on sda1 [ 479.575009] EXT2-fs error (device sda1): read_block_bitmap: Cannot read block bitmap - block_group = 0, block_bitmap = 1025 [ 480.070416] ata3: EH complete [ 480.070420] ata3.00: detaching (SCSI 2:0:0:0) [ 480.070560] Buffer I/O error on device sda1, logical block 0 [ 480.070562] lost page write due to I/O error on sda1 [ 480.070564] EXT2-fs error (device sda1): ext2_get_inode: unable to read inode block - inode=13, block=1027 [ 480.070590] Buffer I/O error on device sda1, logical block 1 [ 480.070592] lost page write due to I/O error on sda1 [ 480.070595] Buffer I/O error on device sda1, logical block 1025 [ 480.070597] lost page write due to I/O error on sda1 [ 480.072352] sd 2:0:0:0: [sda] Synchronizing SCSI cache [ 480.072757] sd 2:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK [ 480.072761] sd 2:0:0:0: [sda] Stopping disk [ 480.072867] sd 2:0:0:0: [sda] START_STOP FAILED [ 480.072869] sd 2:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK,SUGGEST_OK [ 507.467114] scsi 2:0:0:0: rejecting I/O to dead device [ 507.474543] scsi 2:0:0:0: rejecting I/O to dead device [ 507.474546] printk: 2 messages suppressed. [ 507.474549] Buffer I/O error on device sda1, logical block 0 [ 507.474599] lost page write due to I/O error on sda1 [ 507.474603] EXT2-fs error (device sda1): ext2_get_inode: unable to read inode block - inode=12, block=1027 [ 507.474683] scsi 2:0:0:0: rejecting I/O to dead device [ 507.474713] scsi 2:0:0:0: rejecting I/O to dead device [ 507.474715] Buffer I/O error on device sda1, logical block 0 [ 507.474716] lost page write due to I/O error on sda1 [ 507.474718] EXT2-fs error (device sda1): ext2_get_inode: unable to read inode block - inode=13, block=1027 [ 538.888948] scsi 2:0:0:0: rejecting I/O to dead device [ 538.888981] scsi 2:0:0:0: rejecting I/O to dead device [ 538.888984] Buffer I/O error on device sda1, logical block 0 [ 538.888986] lost page write due to I/O error on sda1 [ 538.888988] EXT2-fs error (device sda1): ext2_get_inode: unable to read inode block - inode=2, block=1027 After reattach: [ 583.767068] ata3: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0x2 frozen [ 583.767071] ata3: (irq_stat 0x00000040, connection status changed) [ 584.481189] ata3: soft resetting port [ 589.690932] ata3: port is slow to respond, please be patient (Status 0x80) [ 592.937529] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 592.941346] ata3.00: ATA-6: WDC WD2000JD-00HBB0, 08.02D08, max UDMA/133 [ 592.941349] ata3.00: 390721968 sectors, multi 0: LBA48 [ 592.946372] ata3.00: configured for UDMA/133 [ 592.946375] ata3: EH complete [ 592.946511] scsi 2:0:0:0: Direct-Access ATA WDC WD2000JD-00H 08.0 PQ: 0 ANSI: 5 [ 592.946616] sd 2:0:0:0: [sdc] 390721968 512-byte hardware sectors (200050 MB) [ 592.946673] sd 2:0:0:0: [sdc] Write Protect is off [ 592.946676] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [ 592.946769] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 592.946851] sd 2:0:0:0: [sdc] 390721968 512-byte hardware sectors (200050 MB) [ 592.946906] sd 2:0:0:0: [sdc] Write Protect is off [ 592.946908] sd 2:0:0:0: [sdc] Mode Sense: 00 3a 00 00 [ 592.947001] sd 2:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 592.947005] sdc: sdc1 [ 592.950021] sd 2:0:0:0: [sdc] Attached SCSI disk [ 592.950047] sd 2:0:0:0: Attached scsi generic sg1 type 0 After umount /mnt (which was already dead): [ 837.935980] scsi 2:0:0:0: rejecting I/O to dead device [ 837.943410] scsi 2:0:0:0: rejecting I/O to dead device [ 837.943413] Buffer I/O error on device sda1, logical block 0 [ 837.943452] lost page write due to I/O error on sda1 [ 837.943456] EXT2-fs error (device sda1): ext2_get_inode: unable to read inode block - inode=11, block=1027 [ 842.438081] scsi 2:0:0:0: rejecting I/O to dead device [ 842.445461] scsi 2:0:0:0: rejecting I/O to dead device [ 842.445464] Buffer I/O error on device sda1, logical block 0 [ 842.445466] lost page write due to I/O error on sda1 [ 842.445588] EXT2-fs error (device sda1): ext2_get_inode: unable to read inode block - inode=2, block=1027 [ 842.445675] scsi 2:0:0:0: rejecting I/O to dead device [ 842.445678] Buffer I/O error on device sda1, logical block 0 [ 842.445680] lost page write due to I/O error on sda1
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080217152243.GA51742>