From owner-freebsd-current@freebsd.org Tue Dec 12 18:52:28 2017 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E420CEA2D65 for ; Tue, 12 Dec 2017 18:52:28 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (br1.CN84in.dnsmgr.net [69.59.192.140]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id AB7D26B9B9 for ; Tue, 12 Dec 2017 18:52:28 +0000 (UTC) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: from pdx.rh.CN85.dnsmgr.net (localhost [127.0.0.1]) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3) with ESMTP id vBCIqRGm087702; Tue, 12 Dec 2017 10:52:27 -0800 (PST) (envelope-from freebsd-rwg@pdx.rh.CN85.dnsmgr.net) Received: (from freebsd-rwg@localhost) by pdx.rh.CN85.dnsmgr.net (8.13.3/8.13.3/Submit) id vBCIqRuZ087701; Tue, 12 Dec 2017 10:52:27 -0800 (PST) (envelope-from freebsd-rwg) From: "Rodney W. Grimes" Message-Id: <201712121852.vBCIqRuZ087701@pdx.rh.CN85.dnsmgr.net> Subject: Re: SMART: disk problems on RAIDZ1 pool: (ada6:ahcich6:0:0:0): CAM status: ATA Status Error In-Reply-To: <20171212192220.119ca2d3@thor.intern.walstatt.dynvpn.de> To: "O. Hartmann" Date: Tue, 12 Dec 2017 10:52:27 -0800 (PST) CC: FreeBSD CURRENT X-Mailer: ELM [version 2.4ME+ PL121h (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII X-Mailman-Approved-At: Tue, 12 Dec 2017 23:22:04 +0000 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 12 Dec 2017 18:52:29 -0000 > Hello, > > running CURRENT (recent r326769), I realised that smartmond sends out some console > messages when booting the box: > > [...] > Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 Currently unreadable > (pending) sectors Dec 12 14:14:33 <3.2> box1 smartd[68426]: Device: /dev/ada6, 1 > Offline uncorrectable sectors > [...] > > Checking the drive's SMART log with smartctl (it is one of four 3TB disk drives), I > gather these informations: > > [... smartctl -x /dev/ada6 ...] > Error 42 [17] occurred at disk power-on lifetime: 25335 hours (1055 days + 15 hours) > When the command that caused the error occurred, the device was active or idle. > > After command completion occurred, registers were: > ER -- ST COUNT LBA_48 LH LM LL DV DC > -- -- -- == -- == == == -- -- -- -- -- > 40 -- 51 00 00 00 00 c2 7a 72 98 40 00 Error: UNC at LBA = 0xc27a7298 = 3262804632 > > Commands leading to the command that caused the error were: > CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name > -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- > 60 00 b0 00 88 00 00 c2 7a 73 20 40 08 23:38:12.195 READ FPDMA QUEUED > 60 00 b0 00 80 00 00 c2 7a 72 70 40 08 23:38:12.195 READ FPDMA QUEUED > 2f 00 00 00 01 00 00 00 00 00 10 40 08 23:38:12.195 READ LOG EXT > 60 00 b0 00 70 00 00 c2 7a 73 20 40 08 23:38:09.343 READ FPDMA QUEUED > 60 00 b0 00 68 00 00 c2 7a 72 70 40 08 23:38:09.343 READ FPDMA QUEUED > [...] > > and > > [...] > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > 1 Raw_Read_Error_Rate POSR-K 200 200 051 - 64 > 3 Spin_Up_Time POS--K 178 170 021 - 6075 > 4 Start_Stop_Count -O--CK 098 098 000 - 2406 > 5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0 > 7 Seek_Error_Rate -OSR-K 200 200 000 - 0 > 9 Power_On_Hours -O--CK 066 066 000 - 25339 > 10 Spin_Retry_Count -O--CK 100 100 000 - 0 > 11 Calibration_Retry_Count -O--CK 100 100 000 - 0 > 12 Power_Cycle_Count -O--CK 098 098 000 - 2404 > 192 Power-Off_Retract_Count -O--CK 200 200 000 - 154 > 193 Load_Cycle_Count -O--CK 001 001 000 - 2055746 > 194 Temperature_Celsius -O---K 122 109 000 - 28 > 196 Reallocated_Event_Count -O--CK 200 200 000 - 0 > 197 Current_Pending_Sector -O--CK 200 200 000 - 1 > 198 Offline_Uncorrectable ----CK 200 200 000 - 1 > 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 > 200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 5 > ||||||_ K auto-keep > |||||__ C event count > ||||___ R error rate > |||____ S speed/performance > ||_____ O updated online > |______ P prefailure warning > > [...] The data up to this point informs us that you have 1 bad sector on a 3TB drive, that is actually an expected event given the data error rate on this stuff is such that your gona have these now and again. Given you have 1 single event I would not suspect that this drive is dying, but it would be prudent to prepare for that possibility. > > The ZFS pool is RAIDZ1, comprised of 3 WD Green 3TB HDD and one WD RED 3 TB HDD. The > failure occured is on one of the WD Green 3 TB HDD. Ok, so the data is redundantly protected. This helps a lot. > The pool is marked as "resilvered" - I do scrubbing on a regular basis and the > "resilvering" message has now aapeared the second time in row. Searching the net > recommend on SMART attribute 197 errors, in my case it is one, and in combination with > the problems occured that I should replace the disk. It is probably putting the RAIDZ in that state as the scrub is finding a block it can not read. > > Well, here comes the problem. The box is comprised from "electronical waste" made by > ASRock - it is a Socket 1150/IvyBridge board, which has its last Firmware/BIOS update got > in 2013 and since then UEFI booting FreeBSD from a HDD isn't possible (just to indicate > that I'm aware of having issues with crap, but that is some other issue right now). The > board's SATA connectors are all populated. > > So: Due to the lack of adequate backup space I can only selectively backup portions, most > of the space is occupied by scientific modelling data, which I had worked on. So backup > exists! In one way or the other. My concern is how to replace the faulty HDD! Most > HowTo's indicate a replacement disk being prepared and then "replaced" via ZFS's replace > command. This isn't applicable here. > > Question: is it possible to simply pull the faulty disk (implies I know exactly which one > to pull!) and then prepare and add the replacement HDD and let the system do its job > resilvering the pool? That may work, but I think I have a simpler solution. > > Next question is: I'm about to replace the 3 TB HDD with a more recent and modern 4 TB > HDD (WD RED 4TB). I'm aware of the fact that I can only use 3 TB as the other disks are 3 > TB, but I'd like to know whether FreeBSD's ZFS is capable of handling it? Someone else? > > This is the first time I have issues with ZFS and a faulty drive, so if some of my > questions sound naive, please forgive me. One thing to try is to see if we can get the drive to fix itself, first order of business is can you take this server out of service? If so I would simply try to do a repeat 100 dd if=/dev/whicheverhdisbad of=/dev/null conv=noerror, sync iseek=3262804632 That is trying to read that block 100 times, if it successful even 1 time smart should remap the block and you are all done. If that fails we can try to zero the block, there is a risk here, but raidz should just handle this as a data corruption of a block. This could possibly lead to data loss, so USE AT YOUR OWN RISK ASSESMENT. dd if=/dev/zero of=/dev/whateverdrivehasissues bs=512 count=1 oseek=3262804632 That should forceable overwrite the bad block with 0's, the smart firmware well see this in the pending list, write the data, read it back, if successful remove it from the pending list, if failed reallocate the block and write the 0's to the reallocation and add 1 to the remapped block count. You might google for "how to fix a pending reallocation" > Thanks in advance, > Oliver > -- > O. Hartmann -- Rod Grimes rgrimes@freebsd.org