Date: Mon, 19 May 2008 01:11:54 +1000 From: Andrew Hill <lists@thefrog.net> To: Jeremy Chadwick <koitsu@FreeBSD.org> Cc: freebsd-fs@freebsd.org Subject: Re: ZFS lockup in "zfs" state Message-ID: <93F07874-8D5F-44AE-945F-803FFC3B9279@thefrog.net> In-Reply-To: <20080518124217.GA16222@eos.sc1.parodius.com> References: <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> <20080518124217.GA16222@eos.sc1.parodius.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 18/05/2008, at 10:42 PM, Jeremy Chadwick wrote: > One thing: are the timeouts always on ad0 and ad2? firstly, some relevant output from my dmesg atapci0: <nVidia nForce CK804 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 6.0 on pci0 atapci1: <nVidia nForce CK804 SATA300 controller> port 0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xcc00-0xcc0f mem 0xf4005000-0xf4005fff irq 21 at device 7.0 on pci0 atapci2: <nVidia nForce CK804 SATA300 controller> port 0x9e0-0x9e7,0xbe0-0xbe3,0x960-0x967,0xb60-0xb63,0xe000-0xe00f mem 0xf4000000-0xf4000fff irq 22 at device 8.0 on pci0 atapci3: <SiI SiI 3114 SATA150 controller> port 0x8400-0x8407,0x8800-0x8803,0x8c00-0x8c07,0x9000-0x9003,0x9400-0x940f mem 0xf1004000-0xf10043ff irq 17 at device 9.0 on pci1 <snip> ad0: 238475MB <Hitachi HDS722525VLAT80 V36OA60A> at ata0-master UDMA100 ad2: 238475MB <WDC WD2500PB-98FBA0 15.05R15> at ata1-master UDMA100 ad3: 152627MB <Seagate ST3160812A 3.AAE> at ata1-slave UDMA100 ad4: 476940MB <Seagate ST3500320AS SD15> at ata2-master SATA300 ad6: 715404MB <Seagate ST3750330AS SD15> at ata3-master SATA300 ad8: 305245MB <Seagate ST3320620AS 3.AAK> at ata4-master SATA300 ad10: 305245MB <Seagate ST3320620AS 3.AAE> at ata5-master SATA300 ad12: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA150 and to answer the question, no, i get timeouts on ad0, 2, 4, 6, 8, 10 and 12, but when they occur its always 1 or 2 disks... for various reasons (primarily focusing on space and low-cost, not performance) i have a 7 disk raidz covering a 250GB slice on each of the above 7 disks, and i've made two more zpools from the remaining space on the drives - and yes, i realise this is a bit of a mess and anyone who's set up any kind of production raid would be appalled, but the aim was to make use of some old disks moreso than to have a fast/ clean setup ad0,2,3 are on the nvidia (southbridge) ata controller ad4,6,8,10 are on the nvidia (southbridge) sata controller ad12 is on the SiI 3114 controller so perhaps i can contribute something useful here because of my (odd) set up? my timeouts aren't limited to any one drive/controller/connector-type - i've had timeouts on all 7 of the drives in the raidz (i've yet to see a timeout on ad3 but that disk is rarely accessed so i'm not entirely surprised) i tend to find that the timeouts occur on one or two disks at once - e.g. ad0 and 2 will complain of timeouts, and the system locks up shortly thereafter... the pairs seem to be grouped by the ata controller... which is to say, i often get ad0 and 2 timeouts together, or two of ad4,6,8,10, or 12 on its own... i'm not 100% sure as i've not recorded the pairs each time, but it seems like there's a strong correlation between the drives giving timeouts and the controller they're running on. this might imply its a bug in the controller driver? or it might simply be an effect of the timing of the writes at some level... this correlation seems interesting though, and i've only just noticed it so i'll be keeping track of future timeouts to see if they consistently pair up within a controller there is the obvious power question (8 drives in a standard PC case... my initial guess was power) but i've hooked up a (Fluke 111) multimeter to log the 5 and 12V rails going to the drives, and its been a steady 5.4 and 12.3 V (including during a timeout and lockup) - these both varied by less than 0.1V over fairly long test periods - so i don't think its power, but i'm willing to keep testing anything... i've also run memtest86 on the ram fearing that might have been the cause... > It is possible you have some bad hardware, but there are many of us > who > have seen the above (with or without ZFS) on perfectly good hardware. > For some, changing cables fixed the problem, while for others > absolutely > nothing fixed it (changed cables, changed controller brands, changed > to > new disks). i'm inclined to think that the disks/cables themselves are good (given the timeouts aren't specific to one disk) and given the ram is okay (from the memtest at least), and the timeouts are occurring on multiple controllers, i think this suggests the controllers are probably okay... (i guess it could be in the northbridge or bus still...) > If the DMA timeouts are easily reproducable, please get in touch with > Scott Long <scottl@samsco.org>, who is in the process of researching > why > these happen. Serial console access might be required. will do, thanks for the contacts/wiki page (: Andrew
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?93F07874-8D5F-44AE-945F-803FFC3B9279>