Date: Tue, 10 Apr 2012 15:26:10 -0700 From: John Hickey <jjh@deterlab.net> To: freebsd-scsi@freebsd.org Subject: Re: Write Timeouts with MPS Message-ID: <A82E913C-05F6-4770-A8BF-1193780ACE76@deterlab.net> In-Reply-To: <4F848B93.10402@brockmann-consult.de> References: <20120410015210.GI9589@deterlab.net> <4F848B93.10402@brockmann-consult.de>
next in thread | previous in thread | raw e-mail | index | archive | help
I have 19 drives in my array, so changing them isn't that easy. ;-) = They are Seagate Constellation ES 2TB SAS drives (SEAGATE ST2000NM0001 = 0001) and according to LSI documents my whole setup should be supported. = The drives at least aren't being marked as failed. I believe a change = was made a while back to make FreeBSD less sensitive to these sorts of = timeouts. I have had a panic or two on the system, but haven't tracked = down the exact cause yet. John On Apr 10, 2012, at 12:35 PM, Peter Maloney wrote: > I found this only happens with specific disks / disk firmware... but > nobody seems to listen to me about it. They all seem to blame the > driver. (I blame both, but changing disks is a simple fix.) >=20 > And looking around, most reports are with various Seagates (including > one that can cause this type of error with smartctl -a with a SAS > Seagate, but cannot reproduce with the binary LSI driver) or Samsung > Spinpoints. The only other disk I know of that does this is a Crucial > SSD with old firmware. One guy said he can do a camcontrol rescan to = get > it back; I tried that and get either panics, hangs, or nothing. >=20 > What HBA are you using? With my LSI 9211-8i HBAs, the new 3TB Seagate > greens don't seem to have this problem. I have no idea if different > disks behave differently with different controllers. I asked Seagate > about it and they reply with marketing nonsense about buying = enterprise > disks instead, and say I should buy disks that are on the specific > compatibility list for the HBA. >=20 > I found that with the few disks that I have that fail randomly (and > others), I can reproduce the issue (not exact same symptoms though) by > hot pulling the disk while writing something, putting it back, wait a > few seconds (<10; less than enough for the SCSI controller to rescan) > pull and replace again. The old 2TB seagate greens fail this test, but > the 3TB ones pass. All 2 and 3 TB Hitachis I tried pass this test, as > well as 3TB WD greens. (all enterprise disks I tried pass this test > except the Toshiba 2TB ones I tried) >=20 > If I put a "failed" disk back in, it does not work. If I put it in a > different slot, same. But if I put any other disk in, it works fine. = So > it is the disk, but it is also FreeBSD not being able to reset/rescan > it. But it is simple enough to blame both, and since you can't get rid > of the driver, get different disks (eg. swap them with some different > same sized ones in a different machine). >=20 > Here is my forum thread about it, including disk product ids for ones = I > tested, and a huge list of things that don't fix it. > http://forums.freebsd.org/showthread.php?t=3D28252 >=20 > Peter >=20 >=20 > On 10.04.2012 03:52, John Hickey wrote: >> I've seen people having this problem before, but I don't think anyone >> has figured it out. I am running: >>=20 >> FreeBSD zfs 10.0-CURRENT FreeBSD 10.0-CURRENT #5: Sat Apr 7 18:05:57 = PDT 2012 root@zfs:/usr/obj/usr/src/sys/GENERIC amd64 >>=20 >> I have the latest LSI IT firmware 13 loaded: >>=20 >> mps1: <LSI SAS2008> port 0xc000-0xc0ff mem = 0xfe93c000-0xfe93ffff,0xfe940000-0xfe97ffff irq 16 at device 0.0 on pci5 >> mps1: Firmware: 13.00.01.00, Driver: 13.00.00.00-fbsd >> mps1: IOCCapabilities: = 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>= >>=20 >> All disks are on a SuperMicro SAS II backplane: >>=20 >> root@zfs:/usr/ports/sysutils/dmidecode# camcontrol devlist >> <SEAGATE ST3300657SS 0008> at scbus0 target 0 lun 0 = (da0,pass0) >> <SEAGATE ST3300657SS 0008> at scbus0 target 1 lun 0 = (da1,pass1) >> <SEAGATE ST2000NM0001 0001> at scbus1 target 8 lun 0 = (da2,pass2) >> .... x16 more of the same >> <SEAGATE ST2000NM0001 0001> at scbus1 target 46 lun 0 = (da20,pass20) >> <LSI CORP SAS2X36 0717> at scbus1 target 47 lun 0 = (ses0,pass21) >>=20 >> Essentially when putting the ZFS filesystem under load, I am getting >> these sorts of errors: >>=20 >> (da13:mps1:0:21:0): WRITE(10). CDB: 2a 0 19 29 32 f2 0 1 0 0 length = 131072 SMID 213 terminated ioc 804b scsi 0 state c xfer 0 >> (da7:mps1:0:13:0): WRITE(10). CDB: 2a 0 19 3d fa ae 0 1 0 0 length = 131072 SMID 386 terminated ioc 804b scsi 0 state c xfer 0 >> (da11:mps1:0:18:0): WRITE(10). CDB: 2a 0 18 a 24 ee 0 1 0 0 length = 131072 SMID 542 terminated ioc 804b scsi 0 state c xfer 0 >> (da14:mps1:0:22:0): WRITE(10). CDB: 2a 0 19 2a c6 b1 0 1 0 0 length = 131072 SMID 214 terminated ioc 804b scsi 0 state c xfer 0 >> (da16:mps1:0:25:0): WRITE(10). CDB: 2a 0 19 2b 83 aa 0 1 0 0 length = 131072 SMID 879 terminated ioc 804b scsi 0 state c xfer 0 >> (da7:mps1:0:13:0): WRITE(10). CDB: 2a 0 19 40 d f9 0 1 0 0 length = 131072 SMID 474 terminated ioc 804b scsi 0 state c xfer 0 >> (da9:mps1:0:15:0): WRITE(10). CDB: 2a 0 18 c 3 31 0 1 0 0 length = 131072 SMID 578 terminated ioc 804b scsi 0 state c xfer 0 >> (da4:mps1:0:10:0): WRITE(10). CDB: 2a 0 19 41 6f ff 0 1 0 0 length = 131072 SMID 703 terminated ioc 804b scsi 0 state c xfer 0 >> (da12:mps1:0:19:0): WRITE(10). CDB: 2a 0 18 c e5 2e 0 1 0 0 length = 131072 SMID 684 terminated ioc 804b scsi 0 state c xfer 0 >> (da3:mps1:0:9:0): WRITE(10). CDB: 2a 0 19 41 b1 4b 0 1 0 0 length = 131072 SMID 212 terminated ioc 804b scsi 0 state c xfer 0 >> (da9:mps1:0:15:0): WRITE(10). CDB: 2a 0 18 d 1e 5c 0 1 0 0 length = 131072 SMID 63 terminated ioc 804b scsi 0 state c xfer 0 >> (da11:mps1:0:18:0): WRITE(10). CDB: 2a 0 18 d 56 1c 0 1 0 0 length = 131072 SMID 412 terminated ioc 804b scsi 0 state c xfer 0 >> (da4:mps1:0:10:0): WRITE(10). CDB: 2a 0 19 42 2c f1 0 1 0 0 length = 131072 SMID 1019 terminated ioc 804b scsi 0 state c xfer 0 >> (da11:mps1:0:18:0): WRITE(10). CDB: 2a 0 18 d 6d 22 0 1 0 0 length = 131072 SMID 175 terminated ioc 804b scsi 0 state c xfer 0 >> (da7:mps1:0:13:0): WRITE(10). CDB: 2a 0 19 42 62 bc 0 1 0 0 length = 131072 SMID 458 terminated ioc 804b scsi 0 state c xfer 0 >> (da10:mps1:0:16:0): WRITE(10). CDB: 2a 0 18 f 4b d2 0 1 0 0 length = 131072 SMID 986 terminated ioc 804b scsi 0 state c xfer 0 >> (da3:mps1:0:9:0): WRITE(10). CDB: 2a 0 19 43 f4 50 0 1 0 0 length = 131072 SMID 809 terminated ioc 804b scsi 0 state c xfer 0 >> (da2:mps1:0:8:0): WRITE(10). CDB: 2a 0 19 45 4 18 0 1 0 0 length = 131072 SMID 998 terminated ioc 804b scsi 0 state c xfer 0 >> (da13:mps1:0:21:0): WRITE(10). CDB: 2a 0 19 30 e4 73 0 1 0 0 length = 131072 SMID 489 terminated ioc 804b scsi 0 state c xfer 0 >> (da12:mps1:0:19:0): WRITE(10). CDB: 2a 0 18 10 8d 19 0 1 0 0 length = 131072 SMID 275 terminated ioc 804b scsi 0 state c xfer 0 >> (da14:mps1:0:22:0): WRITE(10). CDB: 2a 0 19 32 e7 0 0 1 0 0 length = 131072 SMID 666 terminated ioc 804b scsi 0 state c xfer 0 >> (da8:mps1:0:14:0): WRITE(10). CDB: 2a 0 18 13 2b 68 0 1 0 0 length = 131072 SMID 463 terminated ioc 804b scsi 0 state c xfer 0 >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >=20 > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A82E913C-05F6-4770-A8BF-1193780ACE76>