Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 13 Aug 2003 13:49:09 -0500
From:      "P. Larry Nelson" <lnelson@uiuc.edu>
To:        aic7xxx@freebsd.org
Subject:   scsi errors only when writing to Promise disks
Message-ID:  <3F3A8825.9EE28ED3@uiuc.edu>

next in thread | raw e-mail | index | archive | help
I have just joined this list in an attempt to try and get some guidance
as to what might be wrong or at least maybe where to turn for help, as
I don't seem to be getting very far with RedHat.  And Google searches
on the errors or the particular Promise raid system lead nowhere.

I'm seeing thousands (last count was 26,000) of the following errors
in /var/log/messages when running some large write tests on an external
disk connected to an Adaptec 29160 (details of the ad hoc test further below):

[sample two line entry:]
 <date/time> <hostname> kernel: (scsi1:A:5:0): parity error detected in Data-out
phase. SEQADDR(0x1a3) SCSIRATE(0xc2)
 <date/time> <hostname> kernel: ^INo terminal CRC packet received
[note that the address in SEQADDR is only thing that changes in previous and 
 subsequent messages]

If you know what's going on, you can stop reading here and email me
the problem, solution, hints, workarounds, commiserations, whatever.
Otherwise, here are many more details.

System description: 
 Software: 
  Red Hat Linux release 9 (Shrike) 
  Linux version 2.4.20-18.9smp (bhcompile@porky.devel.redhat.com) 
  (gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)) #1 SMP Thu May 29 
  06:55:05 EDT 2003 
 [BTW, same problem occurs with RedHat 8]
 
 Drivers loaded: 
  Module                  Size  Used by    Not tainted
  soundcore               7044   0  (autoclean)
  lp                      9188   0  (autoclean)
  parport                39072   0  (autoclean) [lp]
  nfs                    84600   2  (autoclean)
  lockd                  59536   1  (autoclean) [nfs]
  sunrpc                 87516   1  (autoclean) [nfs lockd]
  e1000                  60704   2 
  microcode               5184   0  (autoclean)
  loop                   12888   0  (autoclean)
  keybdev                 2976   0  (unused)
  mousedev                5688   1 
  hid                    22404   0  (unused)
  input                   6208   0  [keybdev mousedev hid]
  usb-uhci               27468   0  (unused)
  usbcore                82816   1  [hid usb-uhci]
  ext3                   73376   4 
  jbd                    56368   4  [ext3]
  lvm-mod                64512   1 
  aic7xxx               142516   5 
  sd_mod                 13452  10 
  scsi_mod              110904   2  [aic7xxx sd_mod]

 I don't know what version of the aic driver is used - how does one tell?
 
 Hardware: 
  Open Storage Solutions 2U rack mount server with Intel SE7500WV2 motherboard, 
  dual Xeon 2GHz processors, 2Gb ram, 18Gb & 73Gb internal scsi disks, Adaptec 
  AIC29160 scsi card, external Promise Ultratrak RM 15000 raid system connected
  to the AIC29160. [all disks are set up for journaling, i.e., ext3]

 Test details:
  The test consists of doing some relatively large copies of files to the 
  external disk, which mounts just fine and shows no errors at all with 
  smallish writes.  Seems like any write (file copy) over, say, 300,000 bytes, 
  will generate the error.  For example, the following command will generate
  two such occurrences of the pair of lines listed above:
   'cp /boot/vmlinuz /mnt2'  
  In this case, the file is a little over 1mb.
 - same test does not generate any errors when writing to the internal disks.
 - moved internal disks to the Adaptec 29160 and tried the write test again - 
   no errors.
 - get same errors regardless whether the test is done against a raid set on
   the external Promise or to a single jbod disk in the Promise.
 - when the exact same hardware setup had Win2k loaded, there were no errors 
   writing to the Promise.
 - when the Promise disk raid was attached to an Alpha running Tru64 unix,
   there were no errors when writing to the disks.
[in other words, this Promise Raid system has been checked out on other systems
with no problems at all]
 - a different scsi controller was not tried (I have no others, besides it 
   worked fine when it was part of the Win2k setup).
 - neither was a different linux tried (like debian or suse, etc.)

In other words, the errors only come when trying to do >~300kb writes thru the 
Adaptec 29160 controller, on RedHat, to a Promise Ultratrak RM 15000 raid 
system.  There doesn't seem to be anything wrong with the files - a diff
of the original and copy shows no differences.

This is all particularly bothersome as I need to set up a number of these
systems as large (multi-terabyte) file servers in order to handle massive
amounts of experimental data.  Another problem I discovered (as we migrate
away from Alphas) is that I'm limited (at present) to 2 TB logical volumes
in LVM, and I need to make upwards to 6 Terabyte lv's, but I digress and 
that's another story....  (I understand that the 2.6 kernel can handle these)

One final note: I am bound to the use of RedHat because of software constraints
imposed by the national lab where the data is being generated (they're using
RedHat, so we have to, also).

Many thanks in advance!
- Larry 
-- 
P. Larry Nelson (217-244-9855) | Systems/Network Administrator
461 Loomis Lab                 | U of I, CITES Departmental Services
1110 W. Green St., Urbana, IL  | Consultant to: High Energy Physics Group
MailTo:lnelson@uiuc.edu        | http://www.uiuc.edu/ph/www/lnelson
-------------------------------------------------------------------------
 "Information without accountability is just noise."  - P.L. Nelson



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3F3A8825.9EE28ED3>