From owner-freebsd-current@freebsd.org Wed Aug 10 18:56:21 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3553CBB5E00 for ; Wed, 10 Aug 2016 18:56:21 +0000 (UTC) (envelope-from ohauer@gmx.de) Received: from mout.gmx.net (mout.gmx.net [212.227.17.22]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9E00B1E90 for ; Wed, 10 Aug 2016 18:56:19 +0000 (UTC) (envelope-from ohauer@gmx.de) Received: from [192.168.100.100] ([87.139.233.65]) by mail.gmx.com (mrgmx101) with ESMTPSA (Nemesis) id 0LabZr-1anjnx3XPr-00mMC3; Wed, 10 Aug 2016 20:56:10 +0200 Subject: Re: Possible zpool online, resilvering issue To: freebsd-current@freebsd.org References: From: olli hauer Cc: Ultima Message-ID: Date: Wed, 10 Aug 2016 20:56:10 +0200 User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K0:n3EixQ+C89zSZRA59bL29IRSAfnG9kRRK+N1LBxbF2vOgdeHwzG Li/fTdSOk9Ilqwt7NDalAuF1XZfX/7xwEOcib7cD3PnGWy/BqECSxgC7QeNcN/l+zSTxJfL hDtT+IltcBRezFGCmc1SStQUhYB7NVE7D4YMoo2AypYVwpLBEH7Lkonx+yD6D0pNsUePNr8 M/LK+ku/dprVkYOfh+vtg== X-UI-Out-Filterresults: notjunk:1;V01:K0:0EIEyQc+Jdc=:s5+ZXuEOdyyK3t50cQgCdW slpZhIeg+ycpHEnP4+nWBilQ8e/D/FYsL7qb2yDryHZ3Iigj9zZclk+/a332jezuvSopb6JQ+ 86sVkqJZ2h6XcoD0YTMnzXwDynwSq73zpAYuV0p4babNvKJu85iqdYejqjTwMbrRVdAQ7q8dt rii1AjxsDXFb7VQjn6/ALjx89Tsig1MVMauckmFZs9BCD+RBEzFyZTk0kbm3sN91RZQw2rwHo govr8hRTgEVFcY7Bf3oKDtX2bkC67+dklVObz8PYEj2KxtvsThTJfEqZ0Z4bSlMY1EwVbMelK wWHjb5hZPib2zHklWzrq9n653U5ynVNCeYUTkl6l6FA1bJYFvt+N8VYPxZir+qRMzPx49HdT7 Rx4NeiDTpSQbhRBZVdmBnemh9qtbuRc5T/BXsy1sJqxlRZnrAlOh2dagvgiyO57pFuzGhK3Z4 Dd7l+GlfapL2DTDmxOTmsaknA8LNIgs0/dt20mizZC9cFtoj4rG6MzPvTvFGfq1LofP5jEaqa B5MkcJcZEWi/gpn7Ol+nPaoTUtJgRK73LsrA0ZVLLM/GJXuCnx4g59cI1cQ505lTAFTWIFy4X boQo1KzZEl6aA8rbbFrSW5k8EuVtQvNgrpaIvRTgwKmKp8sNxLKL7kcuNGTkpX9oj+wl56I5y mkEhofAkExNY0zZfYxRri+l1dfcJ1NAQMK0R3C/WtM5WdqcSK8Wo1RyHOKOXTNW+XIw6RYI+J JiK/Cr74BmIBTv8DBEfSMSoPsHYGEQqdM2Ft9Un2PX/BNl8bWDfEEVoxtSUwIo9vCjW6jEIn5 2UV35fo X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Aug 2016 18:56:21 -0000 On 2016-08-04 07:22, Ultima wrote: > Hello, > > I recently had some issue with a PSU and ran several scrubs on a pool with > around 35T. Random drives would drop and require a zpool online, this found > checksum errors. (as expected) However, after all the scrubs I ran, I think > I may have found a bug with zpool online resilvering process. > > 24 disks total, 4 vdevs raidz2 (6 drives each). > > Before this next part... I had a backup PSU, however it was also going bad > and waiting for RMA. The current one seemed to be dieing but ran fine with > less drives. So I decided I would run the server short 4 drives. > > Started by offline(or already removed from psu) 4 drives from different > vdevs, then ran a scrub to verify everything. Many sum errors were present > on some of the drives, but this was expected due to faulty psu. Then > offlined 4 different drives and onlined the other 4 and scrubbed once > again. After resilver, again, many sum errors on these drives as expected. > > After the scrub completed, I decided to offline 4 different drives, then > online the ones that were out of pool for awhile. During the resilver, > checksum errors were once again found. I was surprised due to the recent > scrub, So I decided to run another scrub, and it found even more checksum > errors on these recently onlined drives. I didn't think much about it, > however after the replacement PSU arrived, I onlined all the drives out of > pool and again, resilver had checksum errors as well as another scrub with > more sum errors. > > Is this issue known? Is it common for a scrub to be required after onlining > a disk that was out of pool for some time? > > The drives are ST4000NM0033, and until recent have never had a single > checksum error in they're lifetime.(at least with zfs) > FreeBSD S1 12.0-CURRENT FreeBSD 12.0-CURRENT #19 r303224: Sat Jul 23 > 10:41:12 EDT 2016 > root@S1:/usr/src/head/obj/usr/src/head/src/sys/MYKERNEL-NODEBUG > amd64 > > > Sorry for the wall of text, but I hope this helps in tracking down this > possible bug. > Perhaps on or more of the drives running out of Realloc Sectors? I had once a case where smartctl showed no issues but zfs scrubbing showed a defect, some weeks later smartctl was showing some reallocated sectors and one week later the HD was out of spare sectors. Have you already tested every single HD for smart issues? -- olli