Date: Sun, 24 Jan 2010 07:25:09 -0600 (CST) From: Wes Morgan <morganw@chemikals.org> To: jhell <jhell@DataIX.net> Cc: freebsd-fs@freebsd.org, Rich <rincebrain@gmail.com> Subject: Re: Errors on a file on a zpool: How to remove? Message-ID: <alpine.BSF.2.00.1001240635360.2160@ibyngvyr> In-Reply-To: <alpine.BSF.2.00.1001240043350.19303@pragry.qngnvk.ybpny> References: <5da0588e1001222223m773648am907267235bdcf882@mail.gmail.com> <alpine.BSF.2.00.1001231733570.2160@ibyngvyr> <5da0588e1001231541l246769eao410c5ea6ccca0de4@mail.gmail.com> <A43CB93C-06D6-406D-A8C0-4E10E85661A2@gmail.com> <5da0588e1001231615t37c22575uedaae938be40f530@mail.gmail.com> <4B5B94B8.7070509@modulus.org> <5da0588e1001231638i349f8f17t297e970b08825441@mail.gmail.com> <alpine.BSF.2.00.1001232307590.83451@pragry.qngnvk.ybpny> <5da0588e1001232017m6c67731fwaa1d71cd86800017@mail.gmail.com> <alpine.BSF.2.00.1001232341590.19303@pragry.qngnvk.ybpny> <5da0588e1001232128w5a551674od0805c2ff0b884ad@mail.gmail.com> <alpine.BSF.2.00.1001240043350.19303@pragry.qngnvk.ybpny>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 24 Jan 2010, jhell wrote: > > On Sun, 24 Jan 2010 00:28, rincebrain@ wrote: > > On Sun, Jan 24, 2010 at 12:15 AM, jhell <jhell@dataix.net> wrote: > > > From what I see and what was already mentioned earlier in this thread is > > > meta data corruption but the checksum errors do not span across the whole > > > pool of vdevs. These are, correct me if I am wrong USB mass storage > > > devices > > > ? SSD ? > > > > 1.5T Seagate 7200RPM drives. > > > > > In the arrangement of the devices on the system are da2,4,5 on the same > > > hub > > > and da6,7 on another ? If this is the case you may have consolidated your > > > errors down to being a USB problem and narrowed down to where they are > > > connected to. > > > > ...no. > > > > All five are on the same SATA controller. These behaviors persist > > independent of which SATA controller they are plugged into, and I've > > tried all seven in the machine. > > > > > What happened to da1,3 ? Were these once connected to the system ? and if > > > so > > > did you start noticing this problem occur roughly about the same period > > > they > > > were removed ? > > > > da1,3 are being used in another disk pool, and were never a part of this > > pool. > > > > This is not an issue of a faulty SATA controller or SATA drives. > > > > This is an issue of "there was a single faulty stick of RAM in the machine". > > > > Yeah I read this earlier, My apologies it slipped while I was writing "mind > went into multi-write single read mode". > > > I have sixteen disks in this machine. These three are having issues > > only on these particular files, and only on these files, not on random > > portions of the disk. The disks never report read errors - the ZFS > > layer is what reports them. SMART is not reporting any difficulties in > > reading any sectors of these disks. > > > > > > I could be mistaken, but I do not believe there to be a faulty > > controller in play at this time. I've rotated the drives among the > > spares of the 24 ports on the SATA controller in question, as well as > > the on-motherboard controller, and this behavior has persisted. > > > > - Rich > > > > As I was thinking earlier... you mentioned you scrubbed multiple times with no > difference. When I was mentioning the attempt to remove/replace I was thinking > this will cause a "re-silvering" of the drives possibly fixing meta-data for > the effected disks if good meta-data still exists somewhere. > > Might be worth a shot but I would start with the replace of the devices that > are showing the errors until you can clear the errors successfully without > them showing up again and/or until you have replaced all disks. This is a non-redundant pool. The remove command will not work. Replace will, but for that pool to function at all, *every* device must be present. If the metadata was recoverable, I think that the scrub would have reported "xxx kb repaired". >From http://dlc.sun.com/osol/docs/content/ZFSADMIN/gbbwl.html: If the object number to a file path cannot be successfully translated, either due to an error or because the object doesn't have a real file path associated with it , as is the case for a dnode_t, then the dataset name followed by the object's number is displayed. For example: monkey/dnode:<0x0> Which seems to be precisely your error. Continuing: Then, try removing the file with the rm command. If this command doesn't work, the corruption is within the file's metadata, and ZFS cannot determine which blocks belong to the file in order to remove the corruption. If the corruption is within a directory or a file's metadata, the only choice is to move the file elsewhere. You can safely move any file or directory to a less convenient location, allowing the original object to be restored in place." In other words, either move the files out of the way or restore the pool. I'd wager that any other filesystem would have simply wiped out entire directory trees or possibly just panicked with this kind of corruption.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1001240635360.2160>