Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 2 Dec 2024 21:15:24 -0800
From:      David Christensen <dpchrist@holgerdanske.com>
To:        questions@freebsd.org
Subject:   Re: CAM status: SCSI Status Error
Message-ID:  <c5291a54-36e7-40fa-ac65-0e4347b05306@holgerdanske.com>
In-Reply-To: <3a9549fa-c8e1-479e-8492-6dd812462731@app.fastmail.com>
References:  <665ca364-6538-4ef7-bb8b-260dd86ca0bb@app.fastmail.com> <20721bcf-7c99-4918-bbb0-53d6c8e9cda7@holgerdanske.com> <3a9549fa-c8e1-479e-8492-6dd812462731@app.fastmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 12/2/24 08:15, Dan Langille wrote:
> On Fri, Nov 22, 2024, at 1:14 PM, David Christensen wrote:
>> On 11/22/24 05:11, Dan Langille wrote:
>>> On FreeBSD 14.1, is this a server issue (e.g. cable/hardware) as opposed to a drive issue?
>>>
>>> Nov 21 05:28:48 r730-03 kernel: (da7:mrsas0:1:7:0): READ(10). CDB: 28 00 aa d9 5b 5f 00 00 20 00
>>> Nov 21 05:28:48 r730-03 kernel: (da7:mrsas0:1:7:0): CAM status: SCSI Status Error
>>> Nov 21 05:28:48 r730-03 kernel: (da7:mrsas0:1:7:0): SCSI status: OK
>>> Nov 21 05:28:54 r730-03 kernel: (da7:mrsas0:1:7:0): READ(10). CDB: 28 00 aa d9 6b 08 00 00 10 00
>>> Nov 21 05:28:54 r730-03 kernel: (da7:mrsas0:1:7:0): CAM status: SCSI Status Error
>>> Nov 21 05:28:54 r730-03 kernel: (da7:mrsas0:1:7:0): SCSI status: OK
>>> Nov 21 05:55:34 r730-03 smartd[17215]: Device: /dev/da7 [SAT], ATA error count increased from 4 to 8
>>
>>
>> I believe those errors are related to the connection between the drive
>> and the host -- e.g. cables, connectors, and/or interface chips.  I
>> would replace the cable with a known good cable.
> 
> This drive is in a drive bay. Perhaps a re-seat is called for.


Yes.  I might clean whatever electrical contacts are accessible with a 
cotton swap and rubbing alcohol, then re-seat the connection a couple of 
times to wipe the pins and sleeves.


>> A failing power supply can cause all sorts of problems.  I would check
>> the PSU with a hardware tester.
> 
> I don't have that option. It is a Dell R730 with dual PSU.


Understood.  Do the PSU's and/or server have PSU test buttons and/or 
status LED's?


>>> Followed by this from time to time:
>>>
>>> Nov 21 16:55:33 r730-03 smartd[17215]: Device: /dev/da7 [SAT], Self-Test Log error count increased from 0 to 1
>>> Nov 22 11:25:35 r730-03 smartd[17215]: Device: /dev/da7 [SAT], 1 Currently unreadable (pending) sectors
>>
>>
>> STFW I found a good explanation for pending sectors:
>>
>> https://superuser.com/questions/384095/how-to-force-a-remap-of-sectors-reported-in-s-m-a-r-t-c5-current-pending-sector
>>
>>
>> If you can identify the address (LBA) of the bad sector, you could use
>> dd(1) to overwrite the bad sector.  If the drive is in an operating
>> pool, this could be risky.  Shutting down and using live media would be
>> safer.  In either case, you will want to scrub afterwards.
> 
> Sounds like RMA is much easier. ;)


If the warranty covers "unreadable (pending) sectors", perhaps so.


Otherwise, I think failing sectors on magnetic HDD's have become a fact 
of life; given the fact that disk drives have become so large and 
contain so many sectors.  With ZFS, sufficient redundancy, regular 
scrubs, and system administrator intervention, if the quantity and 
frequency of failed sectors is small enough then there should be no data 
loss.  Continued use of such drives may be justified.  Of course, 
continue to backup and archive regularly.


> There is a replacement drive here now. I'm just waiting for other hardware to arrive. All the drive bays are full. I'm going to move 2x 2.5" drives to the read via PCIe slots.


What is "read via PCIe slots"?  Please clarify.


> That will allow me to install the new drive, add it as a replacement to the mirror. When resilvered, the old drive will be dropped out of the filesystem.
> 
> Then I can play with zeroing the whole drive. 


I would add the replacement drive to the pool, allow it to resilver, 
remove the drive in question from the pool, physically remove the drive 
in question, and put the drive in question  into a workbench machine for 
testing and trouble-shooting.  I would overwrite the problematic sector 
and then run a SMART long test.


> If energetic, I may then add the drive back as a single drive filesystem (for testing purposes). Then fill it up with data and see how that goes.
> 
> Thank you.


YW.  Let us know how it turns out.


David




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?c5291a54-36e7-40fa-ac65-0e4347b05306>