Date: Sun, 01 Mar 2009 17:58:30 +0200 From: Alexander Motin <mav@mavhome.dp.ua> To: Elliot Schlegelmilch <elliot+list@schlegelmilch.org> Cc: FreeBSD-Current <freebsd-current@freebsd.org> Subject: Re: SATA disks suddenly stop working Message-ID: <49AAB0A6.3040304@mavhome.dp.ua> In-Reply-To: <1235863381.00080963.1235851802@10.7.7.3> References: <go44ht$2i6a$1@FreeBSD.cs.nctu.edu.tw> <1235602472.00079680.1235592003@10.7.7.3> <1235658185.00079898.1235647801@10.7.7.3> <1235863381.00080963.1235851802@10.7.7.3>
next in thread | previous in thread | raw e-mail | index | archive | help
Elliot Schlegelmilch wrote: > On Thu, Feb 26, 2009 at 12:22:12PM +0100, Gary Jennejohn wrote: >> On Wed, 25 Feb 2009 21:56:38 +0200 >> Alexander Motin <mav@FreeBSD.org> wrote: >> >>> Gary Jennejohn wrote: >>>> I've been having lots of problems with SATA drives attached to higher >>>> port numbers, namely ata5 and ata6. >>>> >>>> I was installing Linux under qemu today and it had been running for >>>> several hours and had installed multi-gigabytes of data when qemu >>>> just stopped. >>>> >>>> I noticed that all I/O to the disk had ceased. >>>> >>>> Doing "atacontrol reinit" on the port (ata5) resulted in a message >>>> that the device was not configured, which was patently false since >>>> qemu had just been merrily writing to it. >>>> >>>> This with a kernel made from sources updated today at about 2 PM (GMT+1). >>>> >>>> I've also seen problems with a disk attached to ata6. It just sort >>>> of disappears after a while. >>>> >>>> Disks attached to ata2, ata3 and ata4 don't exhibit any problems. >>> You have told much and same time gave nothing that can be used. >>> >> I was only interested in whether others have seen this problem. I was >> not looking for a solution. >> >>> What controller do you have? What drives on what channels? Is there any >>> kernel messages about the problem? Have you tried to enable verbose >>> messages to get additional details? >>> >> atapci0@pci0:0:17:0: class=0x010601 card=0xb0021458 chip=0x43911002 rev=0x00 hdr=0x00 >> vendor = 'ATI Technologies Inc' >> class = mass storage >> subclass = SATA >> >> There were no kernel messages at all, the drive simply hung. >> >> I'll do a verbose boot and try to reproduce the disk hang later. >> >>> Reinit could return ENXIO if it already was in progress. Disappearing >>> drives are also can be related to that reinit. Can't it be just a real >>> hardware problem? >>> >> I should have mentioned that the error returned was about some IOCTL. >> Can't remember which one right now, but the error message did include >> that the device was not configured. >> >> I've also noticed several times in the past when the problem occurred >> that the BIOS could not enumerate the AHCI disks anymore. I had to >> do a POR. Seems that the controller was completely hosed such that >> a simple reset didn't reinitialize it sufficiently for it to work. >> >> This morning I booted the box and started a cvsup. My repository is >> on a ZFS mirror with the disks on ata3 and ata4. The system hung after >> the data from the server were received, although all the data were >> successfully written to the disks. >> >> I couldn't do anything at all - it looked like the root disk was not >> responding and the disk light was on solid red. I had to do a hard >> reset. >> >> This is the first time I've seen a problem with this port. The root >> disk is on ata2. >> >> I rebooted and turned off MSI. I'll monitor the situation to see >> whether that helps. > > I don't mean to hijack your thread, but I've had problems with one of > my SATA disks falling off the bus. I could usually retrieve it with > an atacontrol detach / retach. However, with a recent kernel all I'm > getting is this: > > ata2: <ATA channel 0> on atapci1 > ata2: AHCI reset...: 2 > ata2: SATA connect time=0ms > ata2: ready wait time=0ms52 (12272 MB) > ata2: software reset port 15... > ata2: ahci_issue_cmd timeout: 100 of 100ms, status=00000001 > ata2: software reset set timeout > ata2: software reset port 0... > ata2: ahci_issue_cmd timeout: 100 of 100ms, status=00000001 > ata2: software reset set timeout > ata2: SIGNATURE: ffffffff > ata2: Unknown signature, assuming disk device > ata2: AHCI reset done: devices=00000001 > ata2: [MPSAFE] > ata2: [ITHREAD] > > One for each channel, up to ata7. Does it happen during boot or what do you mean by unable to reattach drive now? > atapci0@pci0:0:31:1: class=0x01018a card=0x948115d9 chip=0x269e8086 rev=0x09 hdr=0x00 > vendor = 'Intel Corporation' > device = '631xESB/632xESB/3100 Ultra ATA Storage Controller' > class = mass storage > subclass = ATA > > The last known kernel which works was Dec 17, but trying to rebuild a > kernel from that date doesn't see the SATA disks either (as the kernel > which sees the disks zfs doesn't work.) Or perhaps I'm csup'ing > incorrectly. Haven't you tried to just boot previous kernel from kernel.old directory? Or you have already overwritten it with > I'm still trying to back up far enough so it will work. Feb 14 should be fine. I have touched reset sequence on 15. When you succeed to boot, can you try to make some experiments against HEAD, may be some of them fix the problem: 1) comment that line inside ata_ahci_issue_cmd(): ATA_OUTL(ctlr->r_res2, ATA_AHCI_P_FBS + offset, (port << 8) | 0x00000001); 2) comment these lines inside ata_sata_phy_reset(): if ((ATA_IDX_INL(ch, ATA_SCONTROL) & ATA_SC_DET_MASK) == ATA_SC_DET_IDLE) return ata_sata_connect(ch); 3) comment first that line inside ata_ahci_softreset(): return (-1); Thanks. -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?49AAB0A6.3040304>