From owner-freebsd-stable@FreeBSD.ORG  Mon Jan 30 03:03:30 2012
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E991F1065670
	for <freebsd-stable@freebsd.org>; Mon, 30 Jan 2012 03:03:29 +0000 (UTC)
	(envelope-from gpalmer@freebsd.org)
Received: from noop.in-addr.com (mail.in-addr.com [IPv6:2001:470:8:162::1])
	by mx1.freebsd.org (Postfix) with ESMTP id B4AF88FC12
	for <freebsd-stable@freebsd.org>; Mon, 30 Jan 2012 03:03:29 +0000 (UTC)
Received: from gjp by noop.in-addr.com with local (Exim 4.77 (FreeBSD))
	(envelope-from <gpalmer@freebsd.org>)
	id 1RrhWX-000GGb-1i; Sun, 29 Jan 2012 22:03:17 -0500
Date: Sun, 29 Jan 2012 22:03:16 -0500
From: Gary Palmer <gpalmer@freebsd.org>
To: Peter Maloney <peter.maloney@brockmann-consult.de>
Message-ID: <20120130030316.GB60637@in-addr.com>
References: <20120127024815.GD17973@in-addr.com>
	<20120127030906.GA67449@icarus.home.lan>
	<20120127031351.GA67596@icarus.home.lan>
	<20120127034352.GG17973@in-addr.com>
	<4F2298A3.4030204@brockmann-consult.de>
	<20120130014138.GA60637@in-addr.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20120130014138.GA60637@in-addr.com>
X-SA-Exim-Connect-IP: <locally generated>
X-SA-Exim-Mail-From: gpalmer@freebsd.org
X-SA-Exim-Scanned: No (on noop.in-addr.com); SAEximRunCond expanded to false
Cc: freebsd-stable@freebsd.org
Subject: Re: Panic on 7.4-RELEASE-p5
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 30 Jan 2012 03:03:30 -0000

On Sun, Jan 29, 2012 at 08:41:38PM -0500, Gary Palmer wrote:
> On Fri, Jan 27, 2012 at 01:29:23PM +0100, Peter Maloney wrote:
> > On 01/27/2012 04:43 AM, Gary Palmer wrote:
> > >
> > >   After scanning selected spans, do NOT read-scan remainder of disk.
> > > If Selective self-test is pending on power-up, resume after 0 minute delay.
> > >
> > > I noticed a while ago that there were some "bad" sectors on the disk, and
> > > at the time they were under the swap partition if my math was correct,
> > > and the box never swaps so it wasn't a problem.  I don't know if
> > > the errors above are the same ones I saw earlier or not.
> > >
> > > There were no read or write errors on the console prior to the panic
> > > earlier today.  In fact the previos output on the console relates to
> > > the last reboot for a software upgrade (fixing some packages) 11
> > > days prior.  The only thing in logs going back to November relating
> > > to ad1 are boot messages.
> > >
> > > Thanks,
> > >
> > > Gary
> > >
> > 
> > Unmount your swap, and then write zeros to it to relocate the bad sectors.
> > 
> > in one shell:
> > gstat -I 100ms -f da#p#
> > 
> > in another:
> > swapoff /dev/da#p#
> > sysctl kern.geom.debugflags=0x10
> > dd if=/dev/zero of=/dev/da#p# bs=1M
> > (eventually it stops saying end of device or no space left; at this
> > point I am not sure if you should then continue writing where it stopped
> > in 512 byte blocks, or if it wrote a partial 1M in the last 1M)
> > 
> > Watch first shell. If the speed goes up, settles at a certain number,
> > then wildly goes down low and back up to that number, it is possibly
> > working.
> > 
> > Then repeat. If the same wild fluctuations happen, then the drive didn't
> > relocate enough, because it is trying to keep some semi-bad ones, or
> > they are only bad when reading. If it is just settling at a speed and
> > staying there, then it is probably successful. I don't know how reliable
> > it is. I have found it to be 100% reliable in my testing though. But
> > some/most disks lie to you on the "relocated sector count".
> > 
> > And then remount the swap and change that kernel parameter back.
> > sysctl kern.geom.debugflags=0
> > swapon /dev/da#p#
> > 
> > 
> > Your relocated sector count:
> > 
> >   5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
> > 
> > 
> > 
> > However, this does not fix your disk. eg. If you have heads grinding the
> > platter, you have dust flying around, and your disk will get worse.
> > 
> > Be VERY careful using dd to write directly to disks. If you use the
> > wrong slice, or you use the main device without slices and miscalculate,
> > bad things happen. This is why that kernel parameter was set to stop you.
> 
> Hi Peter,
> 
> I did things a little differently.  When I checked swapinfo, apparently I
> set the swap partition up just purely to act as a dump device - it wasn't
> used as swap.  So I tested it:
> 
> # recoverdisk /dev/ad1s1b /dev/ad1s1b
>         start    size     block-len state          done     remaining    % done
>     628097024 1040384       1040384     0     629137408             0 100.00000
> Completed
> 
> smartctl still reports:
> 
>   5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
> 
> I then did a read test across the whole disk with no errors
> 
> # recoverdisk /dev/ad1 /dev/null
>         start    size     block-len state          done     remaining    % done
>  120033640448  483328        483328     0  120034123776             0 100.00000
> Completed
> 
> Reallocated_Sector_Ct is still the same
> 
> I dunno where the problems are/were, but apparently I cannot hit them now
> through just reading the disk or writing to swap.


FYI I just ran both

smartctl -t short /dev/ad1

and

smartctl -t long /dev/ad1

and neither found any problems

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     33819         -
# 2  Short offline       Completed without error       00%     33818         -

Thanks,

Gary