Date: Fri, 25 Sep 2009 10:25:51 +0100 From: "Pegasus Mc Cleaft" <ken@mthelicon.com> To: "James R. Van Artsdalen" <james-freebsd-current@jrv.org> Cc: Alexander Motin <mav@freebsd.org>, FreeBSD Current <freebsd-current@freebsd.org> Subject: Re: SIIS timeout with current r197392: Message-ID: <659421F2E6F84E479FDEE3C9AB2E975F@PegaPegII> In-Reply-To: <4ABC7D4F.3000202@jrv.org> References: <200909212027.49752.ken@mthelicon.com> <4ABC7D4F.3000202@jrv.org>
next in thread | previous in thread | raw e-mail | index | archive | help
----- Original Message ----- From: "James R. Van Artsdalen" <james-freebsd-current@jrv.org> > Pegasus Mc Cleaft wrote: >> Hello Current, >> >> Since my latest build of amd64-current kernel and world (r197392) I am >> getting strange timeout errors in my dmesg and eventual system >> instability. > > I believe mav has stated that error handling in SIIS isn't finished or > is problematic in some way. > > I see similar problems: most of the time an error results in a hung > device, requiring reboot. This usually happens within a TB or two or > intense I/O: I have not yet seen a 6 TB ZFS pool complete a "scrub" due > to this. Hi James, I believe I found the problem with my machine, and it _was_ my machine. The device that was hanging is a Asus CD-ROM drive. The error messages displayed were correct, I had a faulty SATA cable between the controller and the drive (Funny how a SATA cable can go bad spontaniously). Re-boots of the system did not clear the fault, but a full power down and power up would mask the fault for about an hour and then it would start throwing the messages into the log every few seconds. It was this behaviour that lead me to believe it was a problem with the SIIS driver. It wasent until I noticed on a reboot the system hung for a little while while interrogating the drive during POST. After a cable change and a lot of swearing, the computer booted fine and the error has never reappeared. Some lessons learned: 1) Debug messages _MAY_ actually be telling the truth! :> 2) Reboots and software resets wont be heard from a SATA device whos port has been scrambled by bad cabling 3) SATA cables may spontaniously decentergrate. 4) Modern computers respond less to threats than my older machines :> This being said, I have seen the other fault where a device hangs during high load / activity. Mine will, if it is going to do it, hang somewhere around midnight to 3am when I am running maintance on the maching (find / -name "*.core" -exec rm {} /; ). It does exactly as you said where a drive hangs, usually with the activity LED still lit. Sometimes the machine will continue on and ZFS will carryon in a degraded state. The odd thing about this is, it only started to so this when I was having problems with the CD ROM on the SIIS card. The ZFS drives are on a completely different controller (JMicron). When the SIIS controler was waiting for the scrambled port to say hello, all sorts of weird things would happen. I would get lock-up of the mouse for 2 seconds, keyboard would lock and if a key was pressed when it happened, it would trigure off the key-repeat (much to my amusement while reading email, hitting the delete key, have the keyboard hang and watch it quickly run through deleting everything in my in-box :> ). My query is, when it is hung, waiting for the SATA port to respond, could it be possible to have the JMicron ports miss an event, or get a double IRQ and cause the device to lock? Best whishes, Peg
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?659421F2E6F84E479FDEE3C9AB2E975F>