Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 19 May 2008 01:11:54 +1000
From:      Andrew Hill <lists@thefrog.net>
To:        Jeremy Chadwick <koitsu@FreeBSD.org>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS lockup in "zfs" state
Message-ID:  <93F07874-8D5F-44AE-945F-803FFC3B9279@thefrog.net>
In-Reply-To: <20080518124217.GA16222@eos.sc1.parodius.com>
References:  <683A6ED2-0E54-42D7-8212-898221C05150@thefrog.net> <20080518124217.GA16222@eos.sc1.parodius.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On 18/05/2008, at 10:42 PM, Jeremy Chadwick wrote:
> One thing: are the timeouts always on ad0 and ad2?

firstly, some relevant output from my dmesg
atapci0: <nVidia nForce CK804 UDMA133 controller> port  
0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 6.0 on pci0
atapci1: <nVidia nForce CK804 SATA300 controller> port  
0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xcc00-0xcc0f mem  
0xf4005000-0xf4005fff irq 21 at device 7.0 on pci0
atapci2: <nVidia nForce CK804 SATA300 controller> port  
0x9e0-0x9e7,0xbe0-0xbe3,0x960-0x967,0xb60-0xb63,0xe000-0xe00f mem  
0xf4000000-0xf4000fff irq 22 at device 8.0 on pci0
atapci3: <SiI SiI 3114 SATA150 controller> port  
0x8400-0x8407,0x8800-0x8803,0x8c00-0x8c07,0x9000-0x9003,0x9400-0x940f  
mem 0xf1004000-0xf10043ff irq 17 at device 9.0 on pci1

<snip>

ad0: 238475MB <Hitachi HDS722525VLAT80 V36OA60A> at ata0-master UDMA100
ad2: 238475MB <WDC WD2500PB-98FBA0 15.05R15> at ata1-master UDMA100
ad3: 152627MB <Seagate ST3160812A 3.AAE> at ata1-slave UDMA100
ad4: 476940MB <Seagate ST3500320AS SD15> at ata2-master SATA300
ad6: 715404MB <Seagate ST3750330AS SD15> at ata3-master SATA300
ad8: 305245MB <Seagate ST3320620AS 3.AAK> at ata4-master SATA300
ad10: 305245MB <Seagate ST3320620AS 3.AAE> at ata5-master SATA300
ad12: 305245MB <Seagate ST3320620AS 3.AAE> at ata6-master SATA150

and to answer the question, no, i get timeouts on ad0, 2, 4, 6, 8, 10  
and 12, but when they occur its always 1 or 2 disks...

for various reasons (primarily focusing on space and low-cost, not  
performance) i have a 7 disk raidz covering a 250GB slice on each of  
the above 7 disks, and i've made two more zpools from the remaining  
space on the drives - and yes, i realise this is a bit of a mess and  
anyone who's set up any kind of production raid would be appalled, but  
the aim was to make use of some old disks moreso than to have a fast/ 
clean setup

ad0,2,3 are on the nvidia (southbridge) ata controller
ad4,6,8,10 are on the nvidia (southbridge) sata controller
ad12 is on the SiI 3114 controller

so perhaps i can contribute something useful here because of my (odd)  
set up?

my timeouts aren't limited to any one drive/controller/connector-type  
- i've had timeouts on all 7 of the drives in the raidz (i've yet to  
see a timeout on ad3 but that disk is rarely accessed so i'm not  
entirely surprised)

i tend to find that the timeouts occur on one or two disks at once -  
e.g. ad0 and 2 will complain of timeouts, and the system locks up  
shortly thereafter...

the pairs seem to be grouped by the ata controller... which is to say,  
i often get ad0 and 2 timeouts together, or two of ad4,6,8,10, or 12  
on its own... i'm not 100% sure as i've not recorded the pairs each  
time, but it seems like there's a strong correlation between the  
drives giving timeouts and the controller they're running on. this  
might imply its a bug in the controller driver? or it might simply be  
an effect of the timing of the writes at some level... this  
correlation seems interesting though, and i've only just noticed it so  
i'll be keeping track of future timeouts to see if they consistently  
pair up within a controller

there is the obvious power question (8 drives in a standard PC case...  
my initial guess was power) but i've hooked up a (Fluke 111)  
multimeter to log the 5 and 12V rails going to the drives, and its  
been a steady 5.4 and 12.3 V (including during a timeout and lockup) -  
these both varied by less than 0.1V over fairly long test periods - so  
i don't think its power, but i'm willing to keep testing anything...

i've also run memtest86 on the ram fearing that might have been the  
cause...

> It is possible you have some bad hardware, but there are many of us  
> who
> have seen the above (with or without ZFS) on perfectly good hardware.
> For some, changing cables fixed the problem, while for others  
> absolutely
> nothing fixed it (changed cables, changed controller brands, changed  
> to
> new disks).

i'm inclined to think that the disks/cables themselves are good (given  
the timeouts aren't specific to one disk) and given the ram is okay  
(from the memtest at least), and the timeouts are occurring on  
multiple controllers, i think this suggests the controllers are  
probably okay... (i guess it could be in the northbridge or bus  
still...)

> If the DMA timeouts are easily reproducable, please get in touch with
> Scott Long <scottl@samsco.org>, who is in the process of researching  
> why
> these happen.  Serial console access might be required.

will do, thanks for the contacts/wiki page (:

Andrew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?93F07874-8D5F-44AE-945F-803FFC3B9279>