Date: Fri, 15 Aug 2008 01:19:13 +0100 From: Cian Hughes <Ci@nHugh.es> To: Sebastiaan van Erk <sebster@sebster.com> Cc: freebsd-stable@freebsd.org Subject: Re: Stable SATA pci card for FreeBSD 6.x/7.0 Message-ID: <8B25287C-7336-492C-B62E-CB319B8B5DBB@nHugh.es> In-Reply-To: <48A3FCF7.9030905@sebster.com> References: <48982B58.4000406@sebster.com> <48992532.9080503@yandex.ru> <489970CC.4000103@sebster.com> <20080806095748.GA52551@eos.sc1.parodius.com> <20080806101941.GA52952@eos.sc1.parodius.com> <48A2DD60.7090702@sebster.com> <20080814090521.GB27942@groll.co.za> <48A3FCF7.9030905@sebster.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Sebastiaan, Have you tried connecting your 250GB drives to the troublesome controller? If so, does "stressing" them cause the system to panic? ~Cian Hughes -- University of Bristol Medical School On 14 Aug 2008, at 10:37, Sebastiaan van Erk wrote: > Thanks Jonathan, > > I'm starting to expect it has to be the controller as well. About 20 > minutes after I posted this message yesterday (and thus 20 minutes > after ad6 got disconnected - atacontrol list showed "no device > present" for it) the machine crashed while writing to the remaining > ad4 drive (kernel panic). I attached the logs below. I also ran the > long smart self test on both drives, and no errors were found on > either drive (logs also attached). > > Unfortunately I could not attach the new disks to my mainboard SATA > because my mainboard SATA somehow hangs trying to detect them. So I > cannot test if *not* using the controller is going to solve the > problems, though I'm it seems logical at the moment it has to be the > controller, especially if other people have had similar issues. > > I guess I'll be buying another controller. > > Regards, > Sebastiaan > > Jonathan Groll wrote: >> On Wed, Aug 13, 2008 at 03:10:56PM +0200, Sebastiaan van Erk wrote: >>> Hi, >>> >>> Just an update on this issue. >>> >>> Quick summary: I fixed the BIOS issues, the hardware monitor >>> issues, and the rl0/rl1 watchdog timeout issues (it seems). >>> However I'm still having problems with my SATA drives (or at least >>> one of them). More info below. >>> >>> BIOS: >>> I flashed my BIOS to the latest version about a year ago, and >>> never noticed that there was any problem, but it turns out there >>> was. I never reset the BIOS to default factory settings after the >>> upgrade, and it seems the settings were corrupt. After having >>> reset the BIOS to the "default optimized factory settings" it >>> stopped crashing when I go into the H/W monitor and also when >>> using healthd -d (output below): >>> >>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0 >>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00, 1.95, -0.11, -1.54 >>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0 >>> Vcore = 1.44, 3.14; Volt. = 3.33, 4.97, 1.95, -0.11, -1.54 >>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0 >>> Vcore = 1.44, 3.12; Volt. = 3.34, 4.97, 1.95, -0.11, -1.54 >>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0 >>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00, 1.95, -0.11, -1.54 >>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0 >>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00, 1.95, -0.11, -1.54 >>> >>> This also seems to have fixed the rl0 watchdog timeout problems. I >>> no longer see those in my logs. >>> >>> SATA DRIVES: >>> >>> I'm still having problems with the SATA drives. >>> >>> I tried connecting the 1TB Samsung drives to my mainboard, but >>> then the box hangs when booting with the "Detecting IDE drives" >>> message. The regular (PATA) IDE drives are detected first, and >>> then it repeats the "Detecting IDE drives" message to detect the >>> sata drives, and hangs. When I connect my 250GB SATA drives to my >>> mainboard they detect fine, and the box boots normally. >>> >>> I did another rsync of my old mirror (the 250GB disks) to the new >>> mirror (1TB disks), but again one of the disks got detached. This >>> time there are no other messages in the log, the only thing I see >>> is the following: >>> >>> Aug 13 14:35:27 piglet su: sebster to root on /dev/ttyp5 >>> Aug 13 14:55:38 piglet kernel: ad6: FAILURE - device detached >>> Aug 13 14:55:38 piglet kernel: subdisk6: detached >>> Aug 13 14:55:38 piglet kernel: ad6: detached >>> Aug 13 14:55:38 piglet kernel: GEOM_MIRROR: Device gm1: provider >>> ad6 disconnected. >>> Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to >>> size>100K >>> >>> (unfortunate that the log file just got rotated, but in the new >>> log file there is nothing execpt the one expected line: >>> >>> Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to >>> size>100K >>> >>> So, nothing after the disconnect... >>> >>> The questions I have now is: >>> 1) Could an upgrade to FreeBSD 7-STABLE fix the issue (it's a LOT >>> of work for me, but I'll do it if there are SATA driver issues >>> fixed). >> I suspect the problem may be the SiI driver in Freebsd. As a >> reference >> point, I've had a similar problem, even on 7-STABLE, but with sparc64 >> hardware (see earlier post in this thread). >> It'll probably be simplest for you to just buy another controller of >> another brand. On the other hand, it'll be worth knowing exactly what >> is wrong with the SiI driver... >> Cheers, >> Jonathan > Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to > size>100K > Aug 13 15:11:26 piglet su: sebster to root on /dev/ttyp4 > Aug 13 15:34:55 piglet kernel: mirror/ > gm1s1e[WRITE(offset=875450693632, length=2048)]error = 6 > Aug 13 15:34:55 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450695680, length=2048)]error = 6 > > [snip 335750 similar lines] > > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450931200, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450933248, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450935296, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450937344, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450939392, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450941440, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450943488, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450945536, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450947584, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450949632, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450951680, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450953728, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450955776, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450957824, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450959872, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450961920, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450963968, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450966016, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450968064, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450970112, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450972160, length=2048)]error = 6 > Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ > gm1s1e[WRITE(offset=875450974208, length=2048)]error = 6 > Aug 13 15:42:23 piglet syslogd: kernel boot file is /boot/kernel/ > kernel > Aug 13 15:42:23 piglet kernel: Copyright (c) 1992-2008 The FreeBSD > Project. > smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 > Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF INFORMATION SECTION === > Device Model: SAMSUNG HD103UJ > Serial Number: S13PJ1BQ606865 > Firmware Version: 1AA01112 > User Capacity: 1,000,204,886,016 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 8 > ATA Standard is: ATA-8-ACS revision 3b > Local Time is: Thu Aug 14 11:28:13 2008 CEST > > ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual > for details. > > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x02) Offline data collection > activity > was completed without error. > Auto Offline Data Collection: Disabled. > Self-test execution status: ( 0) The previous self-test > routine completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: (11811) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before > entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 198) minutes. > Conveyance self-test routine > recommended polling time: ( 21) minutes. > SCT capabilities: (0x003f) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0007 076 076 011 Pre-fail > Always - 8010 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 8 > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail > Always - 0 > 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail > Offline - 10255 > 9 Power_On_Hours 0x0032 100 100 000 Old_age > Always - 272 > 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail > Always - 0 > 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 8 > 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age > Always - 0 > 183 Unknown_Attribute 0x0032 100 100 000 Old_age > Always - 0 > 184 Unknown_Attribute 0x0033 100 100 099 Pre-fail > Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 > 188 Unknown_Attribute 0x0032 100 100 000 Old_age > Always - 0 > 190 Airflow_Temperature_Cel 0x0022 057 052 000 Old_age > Always - 43 (Lifetime Min/Max 43/48) > 194 Temperature_Celsius 0x0022 056 050 000 Old_age > Always - 44 (Lifetime Min/Max 43/50) > 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age > Always - 195799724 > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age > Always - 0 > 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age > Always - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 0 > Warning: ATA Specification requires self-test log structure revision > number = 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Offline Completed without error 00% > 261 - > # 2 Offline Aborted by host 40% > 251 - > # 3 Short offline Aborted by host 00% > 250 - > > SMART Selective Self-Test Log Data Structure Revision Number (0) > should be 1 > SMART Selective self-test log data structure revision number 0 > Warning: ATA Specification requires selective self-test log data > structure revision number = 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute > delay. > > smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8 > Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF INFORMATION SECTION === > Device Model: SAMSUNG HD103UJ > Serial Number: S13PJ1BQ607102 > Firmware Version: 1AA01112 > User Capacity: 1,000,204,886,016 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 8 > ATA Standard is: ATA-8-ACS revision 3b > Local Time is: Thu Aug 14 11:28:39 2008 CEST > > ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual > for details. > > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x02) Offline data collection > activity > was completed without error. > Auto Offline Data Collection: Disabled. > Self-test execution status: ( 0) The previous self-test > routine completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: (12131) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before > entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 203) minutes. > Conveyance self-test routine > recommended polling time: ( 22) minutes. > SCT capabilities: (0x003f) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail > Always - 0 > 3 Spin_Up_Time 0x0007 077 077 011 Pre-fail > Always - 7810 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 10 > 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail > Always - 0 > 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail > Offline - 9978 > 9 Power_On_Hours 0x0032 100 100 000 Old_age > Always - 272 > 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail > Always - 0 > 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 10 > 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age > Always - 0 > 183 Unknown_Attribute 0x0032 100 100 000 Old_age > Always - 0 > 184 Unknown_Attribute 0x0033 100 100 099 Pre-fail > Always - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age > Always - 0 > 188 Unknown_Attribute 0x0032 100 100 000 Old_age > Always - 0 > 190 Airflow_Temperature_Cel 0x0022 059 054 000 Old_age > Always - 41 (Lifetime Min/Max 41/46) > 194 Temperature_Celsius 0x0022 058 053 000 Old_age > Always - 42 (Lifetime Min/Max 41/47) > 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age > Always - 31616 > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age > Always - 0 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age > Always - 0 > 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age > Always - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 0 > Warning: ATA Specification requires self-test log structure revision > number = 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Offline Completed without error 00% > 261 - > # 2 Offline Aborted by host 40% > 251 - > # 3 Short offline Aborted by host 00% > 250 - > > SMART Selective Self-Test Log Data Structure Revision Number (0) > should be 1 > SMART Selective self-test log data structure revision number 0 > Warning: ATA Specification requires selective self-test log data > structure revision number = 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute > delay. >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8B25287C-7336-492C-B62E-CB319B8B5DBB>