Date: Wed, 25 Apr 2007 09:41:43 +0200 From: =?ISO-8859-1?Q?Johan_Str=F6m?= <johan@stromnet.se> To: freebsd-stable@freebsd.org Subject: ATA driver/gmirror problems, multiple boxes... Message-ID: <A5C0BBF4-8954-445B-B691-90358A2DA819@stromnet.se>
index | next in thread | raw e-mail
Hello I got a few boxes, elfi crus and gw-1, running gmirror. These are three completely different boxes, but all are running 6.1. They all have multiple disks which are gmirrored, two of them SATA-only and one has a mirror between one SATA and one ATA. Some times now and then they all have different problems with the mirrors.. All three in different ways.. although elfi being the one crashing most, its also the one with most disk IO so that might be "expected" (not that it crashes but that its the one crashing most often).. First, some HW spec: elfi: FreeBSD elfi.stromnet.se 6.2-RELEASE FreeBSD 6.2-RELEASE #9: Thu Jan 18 16:53:20 CET 2007 root@:/usr/obj/usr/src/sys/ELFI i386 atapci1: <nVidia nForce3 Pro SATA150 controller> port 0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xdc00-0xdc0f, 0xe000-0xe07f irq 21 at device 10.0 on pci0 ad4: 286187MB <Maxtor 7L300S0 BANC1G10> at ata2-master SATA150 ad6: 286187MB <Maxtor 7L300S0 BANC1G10> at ata3-master SATA150 Mirror gm0s1 consist of ad4+ad6 crus: FreeBSD crus.stromnet.org 6.1-RELEASE FreeBSD 6.1-RELEASE #3: Tue May 9 20:40:23 CEST 2006 johan@elfi.stromnet.org:/usr/obj/usr/ src/sys/GENERIC i386 atapci1: <Promise PDC40518 SATA150 controller> port 0x7480-0x74ff, 0x7800-0x78ff mem 0xfebdb000-0xfebdbfff,0xfebe0000-0xfebfffff irq 22 at device 14.0 on pci1 ad8: 305245MB <Seagate ST3320620AS 3.AAE> at ata4-master SATA150 ad12: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA150 Mirror gm1 consists of ad8+ad12 gw-1: FreeBSD gw-1.stromnet.se 6.2-RELEASE-p1 FreeBSD 6.2-RELEASE-p1 #7: Tue Feb 13 18:24:34 CET 2007 johan@elfi.stromnet.se:/usr/obj/usr/ src/sys/ROUTER.POLLING i386 atapci0: <nVidia nForce2 Pro UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xffa0-0xffaf at device 9.0 on pci0 atapci1: <nVidia nForce2 Pro SATA150 controller> port 0xec00-0xec07,0xe880-0xe883,0xe800-0xe807,0xe480-0xe483,0x7f00-0x7f0f, 0x7c00-0x7c7f irq 20 at device 11. ad2: 38166MB <WDC WD400BB-00CAA1 17.07W17> at ata1-master UDMA100 ad6: 152627MB <SAMSUNG HD160JJ ZM100-41> at ata3-master SATA150 Mirror gm0 consists of ad6s1+ad2 A typical crash on elfi looks like this: Apr 24 05:20:27 elfi kernel: ad6: FAILURE - device detached Apr 24 05:20:27 elfi kernel: subdisk6: detached Apr 24 05:20:27 elfi kernel: ad6: detached Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6 disconnected. Apr 24 05:20:27 elfi kernel: g_vfs_done():mirror/gm0s1f[READ (offset=16972791808, length=16384)]error = 6 This can happen any time of the day, this one was from ~5 in the morning. To recover from this I have to reboot (soft reboot works) the box and then it will rebuild when booted. atacontrol cannot find the disk at all before rebooting. I've tried reinit and detach/attach but no help. A crash on crus can look like this: Apr 23 13:45:49 crus kernel: ad8: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=566657039 Apr 23 13:46:14 crus kernel: ad8: WARNING - READ_DMA48 UDMA ICRC error (retrying request) LBA=566657039 Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly Apr 23 13:46:14 crus kernel: ad8: FAILURE - READ_DMA48 timed out LBA=566657039 Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Request failed (error=5). ad8[READ(offset=290128403968, length=16384)] Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 disconnected. This box can do with a gmirror forget followed by a gmirror insert and it will happily rebuild the array. The worst box is gw-1: Apr 20 03:10:59 gw-1 kernel: ad2: timeout waiting to issue command Apr 20 03:10:59 gw-1 kernel: ad2: error issuing WRITE_DMA command Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Request failed (error=5). ad2[WRITE(offset=37578448384, length=16384)] Apr 20 03:10:59 gw-1 kernel: GEOM_MIRROR: Device gm0: provider ad2 disconnected. Apr 20 07:23:57 gw-1 syslogd: kernel boot file is /boot/kernel/kernel Apr 20 07:23:57 gw-1 kernel: Copyright (c) 1992-2007 The FreeBSD Project. Yes.. it fails and then the whole box totally HANGS... No input possible at all.. had to hard-reboot it with the button... Not good at all.. I have been running the disks that are now in elfi in this machine before, and at that time I had the same problem.. disk problems -> total hang.. That was with sata only, this appears to be a problem with the ATA disk too?.. I have never succeeded to force these crashes.. they appear now and then but I can never produce them on demand.. The crashes happens now and then, no regular intervals though.. For elfi: Apr 24 05:20:27 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad6 disconnected. (I actually cant find any other entry in the logs, but judging from IRC logs: march 28, march 12, feb 13, jan 22, jan 18) For crus: Apr 23 13:46:14 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 disconnected. Apr 13 09:57:49 crus kernel: GEOM_MIRROR: Device gm1: provider ad8 disconnected. I think it has happened once more, but thats it.. For gw-1 it's luckily only once so far.. At least with the current install, it has had problems when the maxtor disks was running in it (and i think it was 6.0 back then) So.. Three different boxes, with three different chipsets... With three different crash scenarios.. But they all have problems.. So where is the actual problem? The HW? The chipset drivers? Gmirror code? I have run SMART tests on the crashing disks, no errors.. I have run powermax (maxtors own test program) a while back on the maxtor disks, no problems.. I have tried changing SATA cables on some of the disks, no difference.. Does anyone have any clue about what can be causing this? What is most likely? How do we hunt this down? Thank you. Johan Ström Stromnet johan@stromnet.se http://www.stromnet.se/help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A5C0BBF4-8954-445B-B691-90358A2DA819>
