Date: Thu, 18 Jul 2013 09:25:19 +0100 From: Bob Bishop <rb@gid.co.uk> To: Dr Josef Karthauser <joe@karthauser.co.uk> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>, "freebsd-stable@freebsd.org" <freebsd-stable@freebsd.org> Subject: Re: Drive failures with ada on FreeBSD-9.1, driver bug or wiring issue? Message-ID: <281DBD06-81D5-4DDD-9464-B96C80C22C3F@gid.co.uk> In-Reply-To: <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk> References: <20130716225013.1C63B23A@babel.karthauser.co.uk> <60F7BE75-5E2F-471E-A9CE-AF4CD17D96E2@karthauser.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, On 18 Jul 2013, at 08:29, Dr Josef Karthauser wrote: > Hi there, >=20 > I'm scratching my head. I've just migrated to a super micro chassis = and at the same time gone from FreeBSD 9.0 to 9.1-RELEASE. >=20 > The machine in question is running a ZFS mirror configuration on two = ada devices (with a 8gb gmirror carved out for swap). >=20 > Since doing so I've been having strange drop outs on the drives; the = just disappear from the bus like so: >=20 > (ada2:ahcich2:0:0:0): removing device entry > (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 > (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error > (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 = (ABRT ) > (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff > (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted > (aprobe0:ahcich2:0:0:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00 00 00 > (aprobe0:ahcich2:0:0:0): CAM status: ATA Status Error > (aprobe0:ahcich2:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error: 04 = (ABRT ) > (aprobe0:ahcich2:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff > (aprobe0:ahcich2:0:0:0): Error 5, Retries exhausted >=20 >=20 > At first I though it was a failing drive - one of the drives did this, = and I limped on a single drive for a week until I could get someone up = to the rack to plug a third drive in. We resilvered the zpool onto the = new device and ran with the failed drive still plugged in (but not = responding to a reset on the ada bus with camcontrol) for a week or so. >=20 > Then, the new drive dropped out in exactly the same way, followed in = short order by the remaining original drive!!! >=20 > After rebooting the machine, and observing all three drives probing = and available, I resilvered the gmirror and zpool again on the two = devices expected that I thought were reliable, but before the = resilvering was completed the new drive dropped out again. >=20 > I'm scratching my head now. I can't imagine that it's a wiring = problem, as they are all on individual SATA buses and individually = cabled. >=20 > Smart isn't reporting an drive issues either=85. :/ >=20 > So, I'm wondering, is it a driver issuer with 9.1-RELEASE, if I = upgrade to 9-RELENG would I expect that to resolve the problem? (Have = there been any reported ada bus issuer reported since last December?) >=20 > The hardware in question is: >=20 > ahci0: <Intel Cougar Point AHCI SATA controller> port = 0xf050-0xf057,0xf040-0xf043,0xf030-0xf037,0xf020-0xf023,0xf000-0xf01f = mem 0xdfb02000-0xdfb027ff irq 19 at device 31.2 on pci0 > ahci0: AHCI v1.30 with 6 3Gbps ports, Port Multiplier not supported > ahcich0: <AHCI channel> at channel 0 on ahci0 > ahcich1: <AHCI channel> at channel 1 on ahci0 > ahcich2: <AHCI channel> at channel 2 on ahci0 > ahcich3: <AHCI channel> at channel 3 on ahci0 > ahcich4: <AHCI channel> at channel 4 on ahci0 > ahcich5: <AHCI channel> at channel 5 on ahci0 > ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 > ada0: <WDC WD1000FYPS-01ZKB0 02.01B01> ATA-8 SATA 2.x device > ada0: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > ada0: Command Queueing enabled > ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) > ada0: Previously was known as ad4 > ada1 at ahcich1 bus 0 scbus1 target 0 lun 0 > ada1: <WDC WD1000FYPS-01ZKB0 02.01B01> ATA-8 SATA 2.x device > ada1: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > ada1: Command Queueing enabled > ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) > ada1: Previously was known as ad6 > ada2 at ahcich2 bus 0 scbus2 target 0 lun 0 > ada2: <WDC WD1000FYPS-01ZKB0 02.01B01> ATA-8 SATA 2.x device > ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes) > ada2: Command Queueing enabled > ada2: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C) > ada2: Previously was known as ad8 >=20 >=20 > Any ideas would be greatly welcomed. >=20 > Thanks, > Joe Me too (over a long period, with various hardware). There is a general problem with energy-saving drives that controllers = don't understand them. Typically the drive decides to go into some = power-saving mode, the controller wants to do some operation, the drive = takes too long to come ready, the controller decides the drive has gone = away. You have to persuade the controller to wait longer for the drive to come = ready, and/or persuade the drive to stay awake. This isn't necessarily = easy, eg the controller's ready wait may not be programmable. (Or avoid such drives like the plague, life's too short). -- Bob Bishop rb@gid.co.uk
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?281DBD06-81D5-4DDD-9464-B96C80C22C3F>